action #34783

Don't let jobs incomplete if mandatory resources are missing

Added by nicksinger almost 2 years ago. Updated 1 day ago.

Status:FeedbackStart date:12/04/2018
Priority:NormalDue date:
Assignee:mkittler% Done:

0%

Category:Feature requests
Target version:Current Sprint
Difficulty:
Duration:

Description

Motivation

It would be nice if openQA could detect missing resources (e.g. 404 on an ISO-download) on its own and report it back to the end-user/reviewer (e.g. set the job to failed and give a short explanation on the details tab).
A typical example where we, as product QA, could benefit from such a feature, are the all-packages-iso-tests: https://openqa.suse.de/tests/1611728 .

Since these tests always need ISO and ISO_1 to be available it can happen from time to time that they incomplete if one of them did not build in our build-system. These tests then show up as incomplete (which usually indicates some serious problem in the backend) and the reviewer needs to read the logs first just to see that one resource was not available.

Acceptance criteria

  • AC1: Jobs that are triggered while necessary assets are missing end up as "failed"
  • AC2: The details tab shows a corresponding test details content

Suggestions


Related issues

Related to openQA Project - action #33946: Iso image of Parent Job is downloaded even if "--skip-dep... Resolved 28/03/2018 12/11/2019
Related to openQA Project - action #28328: job was triggered trying to download HDD image but it's a... Rejected 24/11/2017
Related to openQA Project - action #12180: [webui] Prevent tests to be triggered when required asset... New 31/05/2016
Related to openQA Project - action #46742: test incompletes trying to revert to qemu snapshot auto_r... Resolved 28/01/2019 18/02/2020
Related to openQA Project - action #62420: [epic] Distinguish all types of incompletes Blocked 12/12/2018
Copied to openQA Project - action #63919: Improve handling of additional UEFI settings New 27/02/2020

History

#1 Updated by coolo almost 2 years ago

this is trickier than you might think - because backends other than qemu don't even care for ISO_1

But I guess we can have a pre-start checkup function in the backends that can create a fake result.

#2 Updated by okurz almost 2 years ago

  • Related to action #33946: Iso image of Parent Job is downloaded even if "--skip-deps" property is specified while cloning Child Job added

#3 Updated by okurz almost 2 years ago

  • Category set to Feature requests

coolo wrote:

this is trickier than you might think - because backends other than qemu don't even care for ISO_1

True, this then makes it related to #33946 . IMHO if we can ensure that jobs which do not actually need an asset do not have it registered. Then the check for required assets can be fatal.

#4 Updated by okurz about 1 year ago

  • Related to action #28328: job was triggered trying to download HDD image but it's already gone added

#5 Updated by okurz about 1 year ago

  • Related to action #12180: [webui] Prevent tests to be triggered when required assets are not present (anymore) added

#6 Updated by okurz 5 months ago

  • Related to action #46742: test incompletes trying to revert to qemu snapshot auto_review:"Could not open backing file: Could not open .*.qcow.*No such file or directory", likely premature deletion of files from cache added

#7 Updated by mkittler 5 months ago

because backends other than qemu don't even care for ISO_1

So you're implying we have tests with e.g. ISO_1 being set but the specified BACKEND doesn't even make use of that file? Likely we generally shouldn't schedule tests referencing assets they don't use.


Note that the download and verification of these files is implemented within the worker in isotovideo.pm. Here log messages like [error] [pid:26949] HDD_1 handling Cannot find HDD_1 asset hdd/SLES-15-SP1-s390x-minimal_installed_for_LTP.qcow2! are generated and such job are set to incomplete. We could easily switch that to marking jobs as failed if that makes more sense for users.

the reviewer needs to read the logs first just to see that one resource was not available.

Since the error is added to autoinst-log.txt before even starting isotovideo the log is only a few lines and easy to read. But of course we could add a "result info" field to the job to display such information in the boostrap card on the test details page.

#8 Updated by okurz 5 months ago

that's what the description already states, yes. Also I would be strict that only assets are specified that the job really requires. E.g. fail an s390x z/VM job – that never uses the ISO – when the ISO is not there but specified within the test. One can always overwrite variables, e.g. +ISO= to mark a job as not using it.

EDIT: 2019-12-13: By now more testsuites and job templates already specify +ISO= in case the ISO is not needed so I think we can be strict.

#9 Updated by okurz 4 months ago

  • Description updated (diff)
  • Status changed from New to Workable

#10 Updated by okurz 3 months ago

I was discussing with mkittler about treating 404s from cache as "failed" on the job result level and not incomplete but I failed to find a way to implement it easily given that IIUC $code is not returned, only a generic $ret from cache.

#11 Updated by mkittler 3 months ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler
  • Target version set to Current Sprint

#12 Updated by mkittler 3 months ago

I've been creating a draft PR to avoid at least restarting/cloning job with missing assets: https://github.com/os-autoinst/openQA/pull/2676

When an ISO is scheduled or a single job is posted we can not easily prevent the job creation in the first place. It might be possible that an asset is created by a chained parent or that it needs to be downloaded (via the gru task).

Note that it would be generally nice if the worker could pass a "failure reason" to the web UI for such setup failures (when os-autoinst hasn't even been started yet). "Setup failure" could even be a new job result (the worker internally distinguishes this type of failure). Not sure whether the acceptance criteria to treat those jobs as regular "failed" jobs makes sense (it is likely a test/setup issue but unlikely a product bug).

The suggestion from the ticket description to display the information in the details tab could be implemented like gru download failures (a fake test module is inserted, e.g. https://openqa.opensuse.org/tests/1050658#step/GRU/1).

#13 Updated by okurz 2 months ago

  • Description updated (diff)

As we also discussed this yesterday in person I agree that "failed" might not be the best. I could not come up with anything better so I scratched AC1 without replacement.

#14 Updated by okurz 2 months ago

  • Related to action #62420: [epic] Distinguish all types of incompletes added

#15 Updated by mkittler about 1 month ago

  • PR to show at least a warning if assets are missing: https://github.com/os-autoinst/openQA/pull/2676
  • The error message and the way they are displayed have also been improved (similar to the suggestions mentioned in the ticket).
  • Not sure whether the suggestion "Report the test as failed and not incomplete" from the ticket description would make sense. After all this is a problem of the test setup and not the test itself.

#16 Updated by okurz about 1 month ago

mkittler wrote:

Deployed on o3. I have retriggered https://openqa.opensuse.org/tests/1181939# which mentions as asset opensuse-Tumbleweed-x86_64-20200220-gnome@64bit.qcow2 which seems to be available. Still I think I have seen a warning flash about a (the?) qcow2 file missing which doesn't seem to be the case. The clone ended up https://openqa.opensuse.org/tests/1182597 which couldn't work anyway because of File '/var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/lib/../schedule/yast/nis/nis_server.yaml' does not exist at /var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/lib/scheduler.pm line 109.

EDIT: Ok. I could reproduce the issue it seems. It's actually about the "uefi-vars" image, not the main disk image:

Warnings occurred when restarting jobs:

    Job 1181733 misses the following assets: hdd/opensuse-Tumbleweed-x86_64-20200220-Tumbleweed@64bit-uefi-vars.qcow2

Go to new job.
  • but the new job is fine.

Maybe this just needs an adjustement regarding another "hidden" asset.

  • Not sure whether the suggestion "Report the test as failed and not incomplete" from the ticket description would make sense. After all this is a problem of the test setup caused by wrong settings.

I agree.

#17 Updated by okurz about 1 month ago

The above warning about "uefi-vars" seems to show up in nearly all or all jobs that we retrigger. You made a good choice to make it a non-fatal warning first. I suggest we try to get rid of this warning first as it is not blocking restarts of jobs.

#18 Updated by mkittler about 1 month ago

All assets about that job have already been cleaned up (also the iso image). Therefore I've cloned a different job to reproduce it locally.

I get the same warning. Apparently it expects the asset to be called hdd/opensuse-Tumbleweed-x86_64-20200222-textmode@64bit-uefi-vars.qcow2 but it is actually just called hdd/opensuse-Tumbleweed-x86_64-20200222-textmode@64bit.qcow2 (the -uefi-vars suffix is missing). The only job setting which contains opensuse-Tumbleweed-x86_64-20200222-textmode@64bit-uefi-vars.qcow2 is UEFI_PFLASH_VARS so likely this setting is where the expectation for this asset to be present comes from.

#19 Updated by okurz about 1 month ago

…@64bit-uefi-vars.qcow2 and …@64bit.qcow2 are not the same assets. As you mentioned in chat, it's an additional asset but one that commonly does not exist and is also not a problem for tests. It has special handling so maybe we can regard it as "hidden" asset as well automatically?

EDIT: Discussed with mkittler. There is no warning when we use the "uefi" machine variant which references absolute paths to the qemu uefi vars+code images. Problem appears when UEFI_PFLASH_VARS is specified in settings of a job with a relative path, e.g. https://openqa.opensuse.org/tests/1181733#settings shows "UEFI_PFLASH_VARS=opensuse-Tumbleweed-x86_64-20200220-Tumbleweed@64bit-uefi-vars.qcow2" but this file does not exist and is not mentioned in autoinst-log.txt . https://openqa.opensuse.org/admin/test_suites specifies this variable also in test suites like "create_hdd_textmode" and "boot_to_snapshot" which at least for "openSUSE Tumbleweed" are only scheduled for 64bit machine, i.e. need no UEFI. Why do we even have these variables? I tend towards treating the existing warning as an error instead and try to get rid of this variable if possible.

#20 Updated by mkittler about 1 month ago

I've just tested the job locally again without UEFI_PFLASH_VARS and PUBLISH_PFLASH_VARS and it was able to boot fine.

Note that it failed later on a zypper command (without previous warning):

Retrieving package bc-1.07.1-3.1.x86_64 (1/69), 109.8 KiB (218.7 KiB unpacked)
Media source 'http://openqa.opensuse.org/assets/repo/openSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20200222' does not contain the desired medium
History:
 - File '/media.1/media' not found on medium 'http://openqa.opensuse.org/assets/repo/openSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20200222'

This shows the current limitation that comes from excluding hidden assets from the check. However, I don't think there's an easy way to deal with that in a better way.

I've also tested the full chain of another job locally without UEFI_PFLASH_VARS and without PUBLISH_PFLASH_VARS and the tests passed. So for non-UEFI tests these variables could be removed.

#21 Updated by okurz about 1 month ago

Discussed with mkittler also today. In short: We agreed to hide the warning about UEFI_PFLASH_VARS similar to how we handle UEFI_PFLASH_VARS already in many other locations of os-autoinst as well as openQA. We try to improve the overall situation regarding UEFI by trying to handle the UEFI assets implicitly, i.e. hide from user, but more scalable and easier to manage for users, e.g. users do not need to specify these files in any test suite. Maybe we can just look for the corresponding "…-uefi-vars.qcow2" file whenever UEFI=1 is set. Handled in #63919

@mkittler as a fast fix please make sure the current warning doesn't show for uefi variables as simple as possible – hacky approach allowed ;)

#22 Updated by okurz about 1 month ago

  • Copied to action #63919: Improve handling of additional UEFI settings added

#23 Updated by mkittler 11 days ago

The PR to avoid the warning for UEFI_PFLASH_VARS has been merged: https://github.com/os-autoinst/openQA/pull/2807

I stated options to continue in a PR comment: https://github.com/os-autoinst/openQA/pull/2676#issuecomment-578792752

@okurz proposed the following: https://github.com/os-autoinst/openQA/pull/2676#issuecomment-601553844

I think it makes sense if we suggest restarting the parent job and not just hide the restart button. Maybe we could also not make it a "hard prevent". So one will see the error but still has the option to click on a "Restart parent" or "Force restart" button.

#24 Updated by okurz 10 days ago

mkittler wrote:

[…]

I think it makes sense if we suggest restarting the parent job and not just hide the restart button. Maybe we could also not make it a "hard prevent". So one will see the error but still has the option to click on a "Restart parent" or "Force restart" button.

How about we display the retrigger button crossed out with a message next to it or a hover text explaining "Job is missing mandatory assets for retriggering. Ensure to provide mandatory assets and/or force retriggering over API if necessary". And if a parent exists: "Job is missing mandatory assets for retriggering. You may try to retrigger the parent job that should create the assets and will implicitly retrigger this job as well."

#25 Updated by mkittler 5 days ago

Sounds good

#26 Updated by mkittler 4 days ago

When I think about it, having the restart button immediately crossed out means that the check for missing assets needs already be done when loading the details page. That sounds a little bit wasteful considering we need to parse the job settings again for that check and that the page already needs long enough to load anyways. So I rather show that error only on an attempt to click on the restart button.

#27 Updated by okurz 4 days ago

yeah, sorry, forgot about that. I would use the existing approach with a flash message but instead of warning make it an error and actually not trigger a job which we are now sure can't succeed.

EDIT: So will you try to continue here?

#28 Updated by mkittler 3 days ago

  • Status changed from In Progress to Feedback

#29 Updated by mkittler 1 day ago

I've just tried to restart a job locally with has an empty ISO variable. With my latest change it can not be restarted because the asset iso/ does not exists. Not sure through which experiment I've got this job but maybe it was like this on the production instance and I've just cloned it. Should we support this by adding an exception for empty asset names?

#30 Updated by Xiaojing_liu 1 day ago

mkittler wrote:

I've just tried to restart a job locally with has an empty ISO variable. With my latest change it can not be restarted because the asset iso/ does not exists. Not sure through which experiment I've got this job but maybe it was like this on the production instance and I've just cloned it. Should we support this by adding an exception for empty asset names?

I am not sure this situation in OSD, but in o3, there are some test suites have an empty ISO=. Here is a related ticket: https://progress.opensuse.org/issues/59394

And here is a job with ISO= in settings. https://openqa.opensuse.org/tests/1219118.
Do you mean this situation?

Also available in: Atom PDF