action #128276
closedcoordination #103944: [saga][epic] Scale up: More robust handling of diverse infrastructure with varying performance
coordination #98463: [epic] Avoid too slow asset downloads leading to jobs exceeding the timeout with or run into auto_review:"(timeout: setup exceeded MAX_SETUP_TIME|Cache service queue already full)":retry
Handle workers with busy cache service gracefully by a two-level wait size:M
Description
Motivation¶
As observed in #128267 "cache service full" jobs are still too visible to users and we should reconsider propagating the error that easily to the job level. The idea that mkittler and me had is to ensure that scheduled jobs are only assigned to workers where "cache queue size < cache_queue_limit_concurrent" (timeout is the timeout of some days when scheduled jobs get cancelled implicitly). AFAIK this is already present in the current implementation but should be cross-checked. If the condition is fulfilled and a job is assigned to a worker it can still happen that the cache queue becomes "too long" due to other jobs or the original job adding to the cache queue. This is why in this case jobs assigned to a worker should wait for other download jobs to complete before continuing. Here the timeout is MAX_SETUP_TIME. There should be the second limit "cache_queue_limit_max" (is that the current limit of 5 causing "Cache service queue already full"?) when exceeded incomplete the job as we do but with a more conservative limit, e.g. 50.
Acceptance criteria¶
- AC1: During normal operation on o3 including rebooting o3 workers and "busy days" no "cache service full" incompletes happen
Acceptance tests¶
- AT1-1: Given no jobs running on a worker And clean cache When a job is scheduled Then the job is immediately picked up by the worker, cache service requests are executed, job starts
- AT1-2: Given jobs running on a worker with assigned cache service requests exceeding "cache_queue_limit_concurrent" When a job is scheduled Then the job is not immediately picked up by the worker
- AT1-3: Given jobs running on a worker with assigned cache service requests exceeding "cache_queue_limit_concurrent-1" When jobs are still forcefully assigned to the worker (simulating unfortunate timing aka. "race condition") Then the job assigned cache service request waits for a free slot
- AT1-4: Given jobs running on a worker with assigned cache service requests with cache service requests When exceeding "cache_queue_limit_max" Then jobs incomplete as in before with "Cache service queue already full" (likely rephrased a bit)
Suggestions¶
- Review existing implementation of the check that is done before assigning jobs to workers
- Ensure defaults for the both limit are far enough apart, e.g. "cache_queue_limit_concurrent" = 3, "cache_queue_limit_max" = 50
- Execute ATs to see where implementation is missing
Updated by okurz over 1 year ago
- Copied from action #128267: Restarting jobs (e.g. due to full cache queue) can lead to weird behavior for certain job dependencies (was: Ensure that the incomplete jobs with "cache service full" are properly restarted (take 2)) size:M added
Updated by livdywan over 1 year ago
- Subject changed from Handle workers with busy cache service gracefully by a two-level wait to Handle workers with busy cache service gracefully by a two-level wait size:M
- Status changed from New to Workable
Updated by okurz about 1 year ago
- Target version changed from Ready to Tools - Next
Updated by mkittler 7 months ago · Edited
The last relevant job on o3 is from before my PR was merged (it was merged around 2024-03-07 15:45 and probably deployed soon after; is definitely deployed now):
openqa=# select id, t_finished from jobs where result = 'incomplete' and reason ilike '%cache%queue%full%' order by id desc limit 10;
id | t_finished
---------+---------------------
3994496 | 2024-03-07 11:51:11
3994495 | 2024-03-07 11:51:10
3994485 | 2024-03-07 11:51:15
3994480 | 2024-03-07 11:51:15
3993860 | 2024-03-07 07:40:54
3993859 | 2024-03-07 07:40:54
3993856 | 2024-03-07 07:40:55
3993855 | 2024-03-07 07:40:54
3993684 | 2024-03-07 07:40:16
3993678 | 2024-03-07 07:40:17
(10 rows)
I'll re-conduct the query next week to see for sure whether my PR helped.
I'll also check whether we now have any jobs running into MAX_SETUP_TIME
. So far we don't have any:
openqa=# select id, t_finished from jobs where result = 'incomplete' and reason ilike '%MAX_SETUP_TIME%' order by id desc limit 10;
id | t_finished
----+------------
(0 rows)
Updated by mkittler 7 months ago · Edited
Looks like this wasn't enough:
openqa=# select id, t_finished from jobs where result = 'incomplete' and reason ilike '%cache%queue%full%' order by id desc limit 10;
id | t_finished
---------+---------------------
4002675 | 2024-03-10 10:10:11
4002485 | 2024-03-10 10:10:21
4002368 | 2024-03-10 10:09:30
4002357 | 2024-03-10 10:09:30
4002354 | 2024-03-10 10:09:39
4002339 | 2024-03-10 10:09:30
4002335 | 2024-03-10 10:09:27
4002330 | 2024-03-10 10:09:27
4002281 | 2024-03-10 10:09:29
4002275 | 2024-03-10 10:09:30
Probably this doesn't work at all because slots on e.g. openqaworker-arm22
were affected and that worker doesn't even have 46 slots - so the new hard-limit should never be reached on that machine.
Updated by mkittler 7 months ago
Looks like this affects only arm22:
openqa=# select id, t_finished, (select host from workers where workers.id = jobs.assigned_worker_id) from jobs where result = 'incomplete' and reason ilike '%cache%queue%full%' order by id desc limit 50;
id | t_finished | host
---------+---------------------+--------------------
4002675 | 2024-03-10 10:10:11 | openqaworker-arm22
4002485 | 2024-03-10 10:10:21 | openqaworker-arm22
4002368 | 2024-03-10 10:09:30 | openqaworker-arm22
4002357 | 2024-03-10 10:09:30 | openqaworker-arm22
4002354 | 2024-03-10 10:09:39 | openqaworker-arm22
4002339 | 2024-03-10 10:09:30 | openqaworker-arm22
4002335 | 2024-03-10 10:09:27 | openqaworker-arm22
4002330 | 2024-03-10 10:09:27 | openqaworker-arm22
4002281 | 2024-03-10 10:09:29 | openqaworker-arm22
4002275 | 2024-03-10 10:09:30 | openqaworker-arm22
4002270 | 2024-03-10 10:09:30 | openqaworker-arm22
4002262 | 2024-03-10 10:09:27 | openqaworker-arm22
4002261 | 2024-03-10 10:09:31 | openqaworker-arm22
4002257 | 2024-03-10 10:09:31 | openqaworker-arm22
4002256 | 2024-03-10 10:09:31 | openqaworker-arm22
4002255 | 2024-03-10 10:09:31 | openqaworker-arm22
4002253 | 2024-03-10 10:09:29 | openqaworker-arm22
4002248 | 2024-03-10 10:09:30 | openqaworker-arm22
4000322 | 2024-03-09 17:29:28 | openqaworker-arm22
4000305 | 2024-03-09 17:29:24 | openqaworker-arm22
4000304 | 2024-03-09 17:29:24 | openqaworker-arm22
4000301 | 2024-03-09 17:29:23 | openqaworker-arm22
4000300 | 2024-03-09 17:29:27 | openqaworker-arm22
4000286 | 2024-03-09 17:29:27 | openqaworker-arm22
4000280 | 2024-03-09 17:29:28 | openqaworker-arm22
4000225 | 2024-03-09 17:29:33 | openqaworker-arm22
4000220 | 2024-03-09 17:29:28 | openqaworker-arm22
4000216 | 2024-03-09 17:29:24 | openqaworker-arm22
4000211 | 2024-03-09 17:29:28 | openqaworker-arm22
4000204 | 2024-03-09 17:29:33 | openqaworker-arm22
4000201 | 2024-03-09 17:29:55 | ip-10-252-32-28
…
(50 rows)
And on that machine cat /usr/share/openqa/lib/OpenQA/CacheService/Response.pm
shows that still the old version of the worker package (before my changes) is installed. This is actually a nice situation for verifying whether my change made a difference because now we see the difference between before and after the change and on e.g. arm21 (where the new version has already been deployed) no problematic jobs are produced anymore.
Note that the query for MAX_SETUP_TIME
still returns zero rows so I also don't think we have a regression in that regard.
Updated by mkittler 7 months ago
Looks like arm22 was no longer updated because it had perl-Mojo-IOLoop-ReadWriteProcess and perl-Text-Glob from devel:openQA:Leap:*
installed but we now rely on Leap repos for those. So I will updated it manually (allowing the vendor change) as soon as our repositories are good again¹.
(Currently the update is prevented by File './noarch/os-autoinst-distri-opensuse-deps-1.1710149510.aaa04ac0-lp155.13942.1.noarch.rpm' not found on medium 'http://download.opensuse.org/repositories/devel:openQA/15.5'
and the error didn't go away after refreshing repos multiple times.)