action #96684
closedcoordination #103944: [saga][epic] Scale up: More robust handling of diverse infrastructure with varying performance
coordination #98463: [epic] Avoid too slow asset downloads leading to jobs exceeding the timeout with or run into auto_review:"(timeout: setup exceeded MAX_SETUP_TIME|Cache service queue already full)":retry
Abort asset download via the cache service when related job runs into a timeout (or is otherwise cancelled) size:M
Description
Motivation¶
If jobs run into MAX_SETUP_TIME
(like we've seen in #96557) or are otherwise cancelled the Minion jobs for asset downloads are not cancelled. That means the worker is unlikely to get out of the situation of being overloaded with too many asset download tasks on its own. Stopping inactive or even active Minion jobs for asset downloads when the related openQA jobs have been cancelled would help with that situation.
Acceptance criteria¶
- AC1: Inactive (or even active) Minion jobs are cancelled if the related openQA job is cancelled.
- AC2: A Minion job can be responsible for multiple openQA jobs (if they share the same assets). This should still work so the cancellation (AC1) should only happen if no other openQA job requires the Minion job.
- AC3: No partial files (or even stale data in the database) are left behind.
Suggestions¶
- Cache service downloads are deduplicated, so make sure no downloads are cancelled that are still required by other openQA jobs on the same worker (might require a new sqlite table to keep track of cancelled jobs)
- Increase or remove the cache service backlog limit once download cancellation is implemented
- Use Minion feature to terminate active jobs properly
- Ensure terminating jobs shut down gracefully
- Make sure not to corrupt the SQLite database by terminating the job too aggressively
- Maybe use USR1/USR2 signals to run custom termination code in the job
Updated by mkittler over 3 years ago
- Related to action #96557: jobs run into MAX_SETUP_TIME, one hour between 'Downloading' and 'Download processed' and no useful output in between auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retry added
Updated by okurz over 1 year ago
- Category set to Feature requests
- Priority changed from Normal to High
- Target version changed from future to Ready
I don't know if this really fixes the problem mentioned in the epic. This should be revisited
Updated by mkittler over 1 year ago
- Description updated (diff)
- Difficulty set to hard
Updated by okurz over 1 year ago
- Priority changed from High to Low
- Target version changed from Ready to future
We are looking into #125276 first
Updated by mkittler over 1 year ago
- Subject changed from Abort asset download via the cache service when related job runs into a timeout (or is otherwise cancelled) to Abort asset download via the cache service when related job runs into a timeout (or is otherwise cancelled) size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz over 1 year ago
- Copied to action #128267: Restarting jobs (e.g. due to full cache queue) can lead to weird behavior for certain job dependencies (was: Ensure that the incomplete jobs with "cache service full" are properly restarted (take 2)) size:M added
Updated by okurz about 1 year ago
- Target version changed from Ready to Tools - Next
Updated by okurz about 1 year ago
- Target version changed from Tools - Next to Ready
Updated by okurz about 1 year ago
- Related to action #137300: [FIRING:1] (Incomplete jobs (not restarted) of last 24h alert Salt size:M added
Updated by mkittler 12 months ago
- Status changed from In Progress to Workable
I worked on this yesterday and it is generally not too difficult. However, making the ref-counting this involves race-free is definitely not that easy. I pushed the changes I have so far to GitHub: https://github.com/os-autoinst/openQA/compare/master...Martchus:openQA:withdraw?expand=1
@kraih also pointed out that queries like this would be efficient:
$minion->enqueue('foo', [...], {notes => {belongs_to_job_123 => 1}});
my $results = $minion->jobs({states => ['active'], notes => ['belongs_to_job_123']});
I'm just not sure how to apply it here because finding the right job to cancel is not the problem. The problem is deciding on whether to cancel it because no other jobs use it.
Updated by mkittler 12 months ago
I tried to update this ticket yesterday but progress is down and I guess I lost my comment. Here's the summary I wrote on slack:
I think this feature is likely not worth the effort. I see no way of implementing a ref-counting mechanism for the openQA jobs waiting on a Minion job race-free with Minion notes. So we needed an own SQLite table for the ref-counting. Even then there's the problem of actually stopping the job. To do this nicely we needed to asynchronously start the download, register the SIGUSR1 handler in which we stop Mojo::IOLoop (if running) and then start the IOLoop (keeping it running until the download has finished or the signal handler stopped it). The downloader code is synchronous so this would be a huge refactoring. (Maybe the following would even work because the synchronous download uses the IOLoop under the hood anyways: https://github.com/os-autoinst/openQA/compare/master...Martchus:openQA:withdraw#diff-641e5f9ffd867a4d10f681dd233[…]b7355edc411555f7364770db420dfd3cf) And then there's still the complexity of handling SIGUSR1 in other places of the asset download task (e.g. when handling the case when we couldn't get the lock).
Updated by mkittler 12 months ago · Edited
About stopping: Setting an error on the request object would be the correct way of stopping the synchronous download. These two test cases could be used as a reference: https://github.com/mojolicious/mojo/blob/79fd80ab549a3df79c81c6513fad05088fd2d4dc/t/mojolicious/longpolling_lite_app.t#L204-L234