action #163781
closedJobs randomly fail with unspecified "api failure", there should be more details in the error message size:S
Description
https://progress.opensuse.org/issues/163781
Jobs randomly fail with unspecified "api failure", there should be more details in the error message size:S
Observation¶
A few kernel jobs have failed during upload phase with quite non-descript reason: "api failure". As a result, there's no autoinst-log.txt nor worker-log.txt.
https://openqa.suse.de/tests/14897579
https://openqa.suse.de/tests/14897580
https://openqa.suse.de/tests/14895759
Acceptance criteria¶
- AC1: No jobs can fail with unspecified reason "api failure" without more details
- AC2: API failures are still handled and shown via the reason field
Suggestions¶
- Maybe just consider rephrasing the generic error message in the ternary return in https://github.com/os-autoinst/openQA/blob/b24c267195fc746c17cededd84e9db4789fd6c67/lib/OpenQA/Worker/Job.pm#L576
Updated by livdywan 5 months ago
- Is duplicate of action #162038: No HTTP Response on OSD on 10-06-2024 - auto_review:".*timestamp mismatch - check whether clocks on the local host and the web UI host are in sync":retry size:S added
Updated by nicksinger 4 months ago
- Status changed from New to Resolved
I validated that the openQA changes are deployed and applied my config change manually (including restarting services) for now until our pipelines work again. Until now we don't see the new error message which is expected and good. We discussed that this should be sufficient for now and other alerts (e.g. number of new incomplete jobs) should alert us if the situation gets worse.
Updated by nicksinger 4 months ago
- Is duplicate of deleted (action #162038: No HTTP Response on OSD on 10-06-2024 - auto_review:".*timestamp mismatch - check whether clocks on the local host and the web UI host are in sync":retry size:S)
Updated by nicksinger 4 months ago
- Related to action #162038: No HTTP Response on OSD on 10-06-2024 - auto_review:".*timestamp mismatch - check whether clocks on the local host and the web UI host are in sync":retry size:S added
Updated by nicksinger 4 months ago
- Status changed from Resolved to New
nicksinger wrote in #note-4:
I validated that the openQA changes are deployed and applied my config change manually (including restarting services) for now until our pipelines work again. Until now we don't see the new error message which is expected and good. We discussed that this should be sufficient for now and other alerts (e.g. number of new incomplete jobs) should alert us if the situation gets worse.
Seems like progress/redmine just took my last comment from the other ticket (https://progress.opensuse.org/issues/162038) and applied it here as well which is obviously not changing anything in here -> reopening
Updated by nicksinger 4 months ago
- Related to action #164418: Distinguish "timestamp mismatch" from cases where webUI is slow or where clocks are really differing added
Updated by tinita 4 months ago
- Subject changed from Jobs randomly fail with unspecified "api failure", there should be more details in the error message to Jobs randomly fail with unspecified "api failure", there should be more details in the error message size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by mkittler 2 months ago
- Status changed from Feedback to Resolved
With the PR merged I don't think we'll see jobs with just "api failure" anymore. If I missed cases we can reopen the ticket. I cannot check the cases of the jobs mentioned in the ticket description specifically because they're 404.