action #124146
closed
[alert] Incomplete jobs (not restarted) of last 24h
Added by okurz almost 2 years ago.
Updated almost 2 years ago.
Description
Observation¶
See
https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=17
since 2023-02-08 0800
Acceptance criteria¶
- AC1: Incomplete jobs alerts are not seen on Grafana
Suggestions¶
- The incompletes on broken downloads are likely what's expected by design
- Alerts were triggered because we got too many incompletes
- The asset never existed in the first place?
- Maybe a clean-up deleted an asset too fast (unlikely), this should be possible to find out from the asset table
- Can we detect missing assets when scheduling a job?
- Leave it in scheduled? Show a message "Asset Xyz missing"
- Reach out to whoever is scheduling jobs without valid assets
- Research if the missing link to the parent jopb is actually an openQA bug that's making the private asset inaccessible
Related issues
1 (1 open — 0 closed)
I can't establish an SSH connection to OSD. The web UI on the other hand responded promptly.
I've also noticed earlier today problems with the SSH connection to OSD. It took very long to connect and even then it was very slow. At some point it was better again but now I cannot even establish the connection anymore. Maybe these performance problems are actually related to why we have many incomplete jobs. Unfortunately I can't investigate this any further without access to the database. Ok, it works now after waiting 3 minutes or so - just before I wanted to submit this comment.
Almost all of the incompletes were download failures like https://openqa.suse.de/tests/10451514 and https://openqa.suse.de/tests/10451519 ¹. When I remember correctly, a 404 error shouldn't actually lead to an incomplete here. I'm confusing this with https://github.com/os-autoinst/openQA/pull/4844 (which was about Minion jobs for the downloads on the web UI). So supposedly it works as intended (if the assets were really just wrongly specified, all assets I tried really don't exist).
There were also some incompletes because the cache queue was full: https://openqa.suse.de/tests/10449763
For now I have silenced the alert (see https://monitor.qa.suse.de/alerting/silence/06775c82-cefa-4db0-a3ac-f4734410a31d/edit?alertmanager=grafana).
¹ see
`select count(id), array_agg(id), reason from jobs where t_finished >= '2023-02-07T12:00:00' and t_finished <= '2023-02-09T12:00:00' and result = 'incomplete' and clone_id is null group by reason order by count(id) desc;`
- Description updated (diff)
Leaving it in New for since since we'd like other people's input on how to potentially improve this on the openQA side of things.
What we'll do anyway is try and derecease the priority by getting the scheduled jobs fixed.
- Description updated (diff)
It seems like the job is cloned without being linked to the original parent which contains a private asset. So maybe it's a bug afterall.
We have different issues (the most frequently occurring issue is listed first, nested points are ideas for improvement):
openqa-investigate
doesn't take private assets into account: The original jobs (before any retry by openQA itself) already lack their chained parent dependency (e.g. https://openqa.suse.de/tests/10451500). They have been created by openqa-investigate
. This script apparently blindly re-triggers jobs without checking whether they require a private asset generated by a chained parent job (the job this is cloned by openqa-investigate
has its chained parent as expected). So likely something we need to improve in openqa-investigate
. It should either:
- Determine whether a parent job has generated a private asset and skip the investigation completely (we don't support this case).
- Determine whether a parent job has generated a private asset and clone the parent as well (which leads to many duplicate and unnecessarily cloned jobs).
- Determine whether a parent job has generated a private asset and adjust all asset-related job settings when cloning the investigation job so the private assets can be found. Either:
- Prefix the asset name explicitly in all relevant job settings specifying assets. (Maybe this also needs a slight adjustment on the openQA-side.)
- Specify the parent job ID via
_START_AFTER
when cloning the job. Then the cloned job will reference the parent job of the original one (without re-triggering the parent). This way openQA will be able to make the private asset available.
- The side-effect is that the investigation jobs will show up in the original dependency tree (creating one big dependency tree).
- This approach will not work when the investigation jobs are triggered on a different instance than the original jobs.
openqa-investigate
triggers jobs when the asset is long gone: For instance, https://openqa.suse.de/tests/10449641 was triggered for yesterday for the 3 month old (and already archived) job https://openqa.suse.de/tests/9882097. If it would have re-triggered the parent job to re-create the asset it might have even worked. However, it is questionable why this job was triggered in the first place after such a long delay. Almost all jobs I could find via the mentioned SQL query are cases like this (besides 1.).
- Skip investigating jobs that are too old.
- Skip investigating if not all assets are present anymore.
- Clone the parent job as necessary (similarly to 1.2).
- Other errors, e.g. incompletes due to syntax errors or because of the backend died. Those aren't that many compared to 1 and 2.
- Status changed from New to Feedback
- Priority changed from Immediate to High
The alert is back to good and this is about mostly jobs that people don't care so much about so the urgency was handled.
Talked about it.
openqa-investigate uses openqa-clone-job. openqa-clone-job checks if assets are there when trying to download. But when we call openqa-clone-job … --within-instance
there is no download attempt hence the invocation never fails. openqa-clone-job already knows the option --ignore-missing-assets
. We should add an additional mode --check-assets
or similar to call a HEAD request to check for the presence of assets regardless of --skip-download
or not.
If the potentially recursive asset check finds out that even the parent(s) does not have any necessary root assets anymore, e.g. the original Tumbleweed DVD ISO file, then report that explicitly to the user in the output of openqa-investigate that ends up in openQA comments
If the asset check finds out that an asset is missing but a parent should be able to create it then openqa-clone-job on request or automatically trying to be "smart" should include the asset generation parent in the clone call
openqa-clone-job should be smart enough to handle private assets for the option "--within-instance" as we should trust that the private asset actually exists there.
Please put those four ideas into new feature request tickets in "future" or just add to #65271 at least for 2.-4. as they are a single sentence for now.
EDIT: Ok, I just put 2.-4. into #65271
- Copied to coordination #124394: [epic] openqa-clone-job should check for missing assets when explicitly selected by option or by default on "--within-instance" added
- Status changed from Feedback to Resolved
I put the rest into #124394 for the feature request. The alert was properly handled and we are green again.
Also available in: Atom
PDF