action #34783
closedDon't let jobs incomplete if mandatory resources are missing
0%
Description
Motivation¶
It would be nice if openQA could detect missing resources (e.g. 404 on an ISO-download) on its own and report it back to the end-user/reviewer (e.g. set the job to failed and give a short explanation on the details tab).
A typical example where we, as product QA, could benefit from such a feature, are the all-packages-iso-tests: https://openqa.suse.de/tests/1611728 .
Since these tests always need ISO and ISO_1 to be available it can happen from time to time that they incomplete if one of them did not build in our build-system. These tests then show up as incomplete (which usually indicates some serious problem in the backend) and the reviewer needs to read the logs first just to see that one resource was not available.
Acceptance criteria¶
- AC1: Jobs that are triggered while necessary assets are missing end up as "failed"
- AC2: The details tab shows a corresponding test details content
Suggestions¶
- Check the status-code on asset downloads
- Display 404-errors with a (maybe customizable) message in the details-tab, e.g. like from "isosize" in https://openqa.opensuse.org/tests/1112060#step/isosize/1
- Report the test as failed and not incomplete
Updated by coolo almost 7 years ago
this is trickier than you might think - because backends other than qemu don't even care for ISO_1
But I guess we can have a pre-start checkup function in the backends that can create a fake result.
Updated by okurz almost 7 years ago
- Related to action #33946: Iso image of Parent Job is downloaded even if "--skip-deps" property is specified while cloning Child Job added
Updated by okurz almost 7 years ago
- Category set to Feature requests
coolo wrote:
this is trickier than you might think - because backends other than qemu don't even care for ISO_1
True, this then makes it related to #33946 . IMHO if we can ensure that jobs which do not actually need an asset do not have it registered. Then the check for required assets can be fatal.
Updated by okurz about 6 years ago
- Related to action #28328: job was triggered trying to download HDD image but it's already gone added
Updated by okurz about 6 years ago
- Related to action #12180: [webui] Prevent tests to be triggered when required assets are not present (anymore) added
Updated by okurz about 5 years ago
- Related to action #46742: test incompletes trying to revert to qemu snapshot auto_review:"Could not open backing file: Could not open .*.qcow.*No such file or directory", likely premature deletion of files from cache added
Updated by mkittler about 5 years ago
because backends other than qemu don't even care for ISO_1
So you're implying we have tests with e.g. ISO_1
being set but the specified BACKEND
doesn't even make use of that file? Likely we generally shouldn't schedule tests referencing assets they don't use.
Note that the download and verification of these files is implemented within the worker in isotovideo.pm
. Here log messages like [error] [pid:26949] HDD_1 handling Cannot find HDD_1 asset hdd/SLES-15-SP1-s390x-minimal_installed_for_LTP.qcow2!
are generated and such job are set to incomplete. We could easily switch that to marking jobs as failed if that makes more sense for users.
the reviewer needs to read the logs first just to see that one resource was not available.
Since the error is added to autoinst-log.txt
before even starting isotovideo the log is only a few lines and easy to read. But of course we could add a "result info" field to the job to display such information in the boostrap card on the test details page.
Updated by okurz about 5 years ago
that's what the description already states, yes. Also I would be strict that only assets are specified that the job really requires. E.g. fail an s390x z/VM job – that never uses the ISO – when the ISO is not there but specified within the test. One can always overwrite variables, e.g. +ISO=
to mark a job as not using it.
EDIT: 2019-12-13: By now more testsuites and job templates already specify +ISO=
in case the ISO is not needed so I think we can be strict.
Updated by okurz about 5 years ago
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz about 5 years ago
I was discussing with mkittler about treating 404s from cache as "failed" on the job result level and not incomplete but I failed to find a way to implement it easily given that IIUC $code is not returned, only a generic $ret from cache.
Updated by mkittler about 5 years ago
- Status changed from Workable to In Progress
- Assignee set to mkittler
- Target version set to Current Sprint
Updated by mkittler about 5 years ago
I've been creating a draft PR to avoid at least restarting/cloning job with missing assets: https://github.com/os-autoinst/openQA/pull/2676
When an ISO is scheduled or a single job is posted we can not easily prevent the job creation in the first place. It might be possible that an asset is created by a chained parent or that it needs to be downloaded (via the gru task).
Note that it would be generally nice if the worker could pass a "failure reason" to the web UI for such setup failures (when os-autoinst hasn't even been started yet). "Setup failure" could even be a new job result (the worker internally distinguishes this type of failure). Not sure whether the acceptance criteria to treat those jobs as regular "failed" jobs makes sense (it is likely a test/setup issue but unlikely a product bug).
The suggestion from the ticket description to display the information in the details tab could be implemented like gru download failures (a fake test module is inserted, e.g. https://openqa.opensuse.org/tests/1050658#step/GRU/1).
Updated by okurz about 5 years ago
- Description updated (diff)
As we also discussed this yesterday in person I agree that "failed" might not be the best. I could not come up with anything better so I scratched AC1 without replacement.
Updated by okurz about 5 years ago
- Related to coordination #62420: [epic] Distinguish all types of incompletes added
Updated by mkittler almost 5 years ago
- PR to show at least a warning if assets are missing: https://github.com/os-autoinst/openQA/pull/2676
- The error message and the way they are displayed have also been improved (similar to the suggestions mentioned in the ticket).
- Not sure whether the suggestion "Report the test as failed and not incomplete" from the ticket description would make sense. After all this is a problem of the test setup and not the test itself.
Updated by okurz almost 5 years ago
mkittler wrote:
- PR to show at least a warning if assets are missing: https://github.com/os-autoinst/openQA/pull/2676
- The error message and the way they are displayed have also been improved (similar to the suggestions mentioned in the ticket).
Deployed on o3. I have retriggered https://openqa.opensuse.org/tests/1181939# which mentions as asset opensuse-Tumbleweed-x86_64-20200220-gnome@64bit.qcow2 which seems to be available. Still I think I have seen a warning flash about a (the?) qcow2 file missing which doesn't seem to be the case. The clone ended up https://openqa.opensuse.org/tests/1182597 which couldn't work anyway because of File '/var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/lib/../schedule/yast/nis/nis_server.yaml' does not exist at /var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/lib/scheduler.pm line 109.
EDIT: Ok. I could reproduce the issue it seems. It's actually about the "uefi-vars" image, not the main disk image:
- Retriggered https://openqa.opensuse.org/tests/1181733
- saw warning
Warnings occurred when restarting jobs:
Job 1181733 misses the following assets: hdd/opensuse-Tumbleweed-x86_64-20200220-Tumbleweed@64bit-uefi-vars.qcow2
Go to new job.
- but the new job is fine.
Maybe this just needs an adjustement regarding another "hidden" asset.
- Not sure whether the suggestion "Report the test as failed and not incomplete" from the ticket description would make sense. After all this is a problem of the test setup caused by wrong settings.
I agree.
Updated by okurz almost 5 years ago
The above warning about "uefi-vars" seems to show up in nearly all or all jobs that we retrigger. You made a good choice to make it a non-fatal warning first. I suggest we try to get rid of this warning first as it is not blocking restarts of jobs.
Updated by mkittler almost 5 years ago
All assets about that job have already been cleaned up (also the iso image). Therefore I've cloned a different job to reproduce it locally.
I get the same warning. Apparently it expects the asset to be called hdd/opensuse-Tumbleweed-x86_64-20200222-textmode@64bit-uefi-vars.qcow2
but it is actually just called hdd/opensuse-Tumbleweed-x86_64-20200222-textmode@64bit.qcow2
(the -uefi-vars
suffix is missing). The only job setting which contains opensuse-Tumbleweed-x86_64-20200222-textmode@64bit-uefi-vars.qcow2
is UEFI_PFLASH_VARS
so likely this setting is where the expectation for this asset to be present comes from.
Updated by okurz almost 5 years ago
…@64bit-uefi-vars.qcow2 and …@64bit.qcow2 are not the same assets. As you mentioned in chat, it's an additional asset but one that commonly does not exist and is also not a problem for tests. It has special handling so maybe we can regard it as "hidden" asset as well automatically?
EDIT: Discussed with mkittler. There is no warning when we use the "uefi" machine variant which references absolute paths to the qemu uefi vars+code images. Problem appears when UEFI_PFLASH_VARS is specified in settings of a job with a relative path, e.g. https://openqa.opensuse.org/tests/1181733#settings shows "UEFI_PFLASH_VARS=opensuse-Tumbleweed-x86_64-20200220-Tumbleweed@64bit-uefi-vars.qcow2" but this file does not exist and is not mentioned in autoinst-log.txt . https://openqa.opensuse.org/admin/test_suites specifies this variable also in test suites like "create_hdd_textmode" and "boot_to_snapshot" which at least for "openSUSE Tumbleweed" are only scheduled for 64bit machine, i.e. need no UEFI. Why do we even have these variables? I tend towards treating the existing warning as an error instead and try to get rid of this variable if possible.
Updated by mkittler almost 5 years ago
I've just tested the job locally again without UEFI_PFLASH_VARS
and PUBLISH_PFLASH_VARS
and it was able to boot fine.
Note that it failed later on a zypper command (without previous warning):
Retrieving package bc-1.07.1-3.1.x86_64 (1/69), 109.8 KiB (218.7 KiB unpacked)
Media source 'http://openqa.opensuse.org/assets/repo/openSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20200222' does not contain the desired medium
History:
- File '/media.1/media' not found on medium 'http://openqa.opensuse.org/assets/repo/openSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20200222'
This shows the current limitation that comes from excluding hidden assets from the check. However, I don't think there's an easy way to deal with that in a better way.
I've also tested the full chain of another job locally without UEFI_PFLASH_VARS and without PUBLISH_PFLASH_VARS and the tests passed. So for non-UEFI tests these variables could be removed.
Updated by okurz almost 5 years ago
Discussed with mkittler also today. In short: We agreed to hide the warning about UEFI_PFLASH_VARS similar to how we handle UEFI_PFLASH_VARS already in many other locations of os-autoinst as well as openQA. We try to improve the overall situation regarding UEFI by trying to handle the UEFI assets implicitly, i.e. hide from user, but more scalable and easier to manage for users, e.g. users do not need to specify these files in any test suite. Maybe we can just look for the corresponding "…-uefi-vars.qcow2" file whenever UEFI=1 is set. Handled in #63919
@mkittler as a fast fix please make sure the current warning doesn't show for uefi variables as simple as possible – hacky approach allowed ;)
Updated by okurz almost 5 years ago
- Copied to action #63919: Improve handling of additional UEFI settings added
Updated by mkittler almost 5 years ago
The PR to avoid the warning for UEFI_PFLASH_VARS
has been merged: https://github.com/os-autoinst/openQA/pull/2807
I stated options to continue in a PR comment: https://github.com/os-autoinst/openQA/pull/2676#issuecomment-578792752
@okurz proposed the following: https://github.com/os-autoinst/openQA/pull/2676#issuecomment-601553844
I think it makes sense if we suggest restarting the parent job and not just hide the restart button. Maybe we could also not make it a "hard prevent". So one will see the error but still has the option to click on a "Restart parent" or "Force restart" button.
Updated by okurz almost 5 years ago
mkittler wrote:
[…]
I think it makes sense if we suggest restarting the parent job and not just hide the restart button. Maybe we could also not make it a "hard prevent". So one will see the error but still has the option to click on a "Restart parent" or "Force restart" button.
How about we display the retrigger button crossed out with a message next to it or a hover text explaining "Job is missing mandatory assets for retriggering. Ensure to provide mandatory assets and/or force retriggering over API if necessary". And if a parent exists: "Job is missing mandatory assets for retriggering. You may try to retrigger the parent job that should create the assets and will implicitly retrigger this job as well."
Updated by mkittler almost 5 years ago
When I think about it, having the restart button immediately crossed out means that the check for missing assets needs already be done when loading the details page. That sounds a little bit wasteful considering we need to parse the job settings again for that check and that the page already needs long enough to load anyways. So I rather show that error only on an attempt to click on the restart button.
Updated by okurz almost 5 years ago
yeah, sorry, forgot about that. I would use the existing approach with a flash message but instead of warning make it an error and actually not trigger a job which we are now sure can't succeed.
EDIT: So will you try to continue here?
Updated by mkittler almost 5 years ago
- Status changed from In Progress to Feedback
PR has been merged: https://github.com/os-autoinst/openQA/pull/2886
Updated by mkittler almost 5 years ago
I've just tried to restart a job locally with has an empty ISO
variable. With my latest change it can not be restarted because the asset iso/
does not exists. Not sure through which experiment I've got this job but maybe it was like this on the production instance and I've just cloned it. Should we support this by adding an exception for empty asset names?
Updated by Xiaojing_liu almost 5 years ago
mkittler wrote:
I've just tried to restart a job locally with has an empty
ISO
variable. With my latest change it can not be restarted because the assetiso/
does not exists. Not sure through which experiment I've got this job but maybe it was like this on the production instance and I've just cloned it. Should we support this by adding an exception for empty asset names?
I am not sure this situation in OSD, but in o3, there are some test suites have an empty ISO=
. Here is a related ticket: https://progress.opensuse.org/issues/59394
And here is a job with ISO=
in settings. https://openqa.opensuse.org/tests/1219118.
Do you mean this situation?
Updated by mkittler almost 5 years ago
PR for allowing empty assets: https://github.com/os-autoinst/openQA/pull/2915
Updated by mkittler almost 5 years ago
- Status changed from Feedback to Resolved
- Target version deleted (
Current Sprint)
The PR has been merged. Otherwise I'm not aware of something that is missing. We could add a "force restart" button directly on the web UI (instead of letting people use the API) if that is wanted.
Feel free to reopen the ticket if something is missing.