Project

General

Profile

Actions

action #128276

closed

coordination #103944: [saga][epic] Scale up: More robust handling of diverse infrastructure with varying performance

coordination #98463: [epic] Avoid too slow asset downloads leading to jobs exceeding the timeout with or run into auto_review:"(timeout: setup exceeded MAX_SETUP_TIME|Cache service queue already full)":retry

Handle workers with busy cache service gracefully by a two-level wait size:M

Added by okurz about 1 year ago. Updated 3 months ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
Feature requests
Target version:
Start date:
2023-04-25
Due date:
% Done:

0%

Estimated time:

Description

Motivation

As observed in #128267 "cache service full" jobs are still too visible to users and we should reconsider propagating the error that easily to the job level. The idea that mkittler and me had is to ensure that scheduled jobs are only assigned to workers where "cache queue size < cache_queue_limit_concurrent" (timeout is the timeout of some days when scheduled jobs get cancelled implicitly). AFAIK this is already present in the current implementation but should be cross-checked. If the condition is fulfilled and a job is assigned to a worker it can still happen that the cache queue becomes "too long" due to other jobs or the original job adding to the cache queue. This is why in this case jobs assigned to a worker should wait for other download jobs to complete before continuing. Here the timeout is MAX_SETUP_TIME. There should be the second limit "cache_queue_limit_max" (is that the current limit of 5 causing "Cache service queue already full"?) when exceeded incomplete the job as we do but with a more conservative limit, e.g. 50.

Acceptance criteria

  • AC1: During normal operation on o3 including rebooting o3 workers and "busy days" no "cache service full" incompletes happen

Acceptance tests

  • AT1-1: Given no jobs running on a worker And clean cache When a job is scheduled Then the job is immediately picked up by the worker, cache service requests are executed, job starts
  • AT1-2: Given jobs running on a worker with assigned cache service requests exceeding "cache_queue_limit_concurrent" When a job is scheduled Then the job is not immediately picked up by the worker
  • AT1-3: Given jobs running on a worker with assigned cache service requests exceeding "cache_queue_limit_concurrent-1" When jobs are still forcefully assigned to the worker (simulating unfortunate timing aka. "race condition") Then the job assigned cache service request waits for a free slot
  • AT1-4: Given jobs running on a worker with assigned cache service requests with cache service requests When exceeding "cache_queue_limit_max" Then jobs incomplete as in before with "Cache service queue already full" (likely rephrased a bit)

Suggestions

  • Review existing implementation of the check that is done before assigning jobs to workers
  • Ensure defaults for the both limit are far enough apart, e.g. "cache_queue_limit_concurrent" = 3, "cache_queue_limit_max" = 50
  • Execute ATs to see where implementation is missing

Related issues 1 (0 open1 closed)

Copied from openQA Project - action #128267: Restarting jobs (e.g. due to full cache queue) can lead to weird behavior for certain job dependencies (was: Ensure that the incomplete jobs with "cache service full" are properly restarted (take 2)) size:MResolvedmkittler2023-05-10

Actions
Actions #1

Updated by okurz about 1 year ago

  • Copied from action #128267: Restarting jobs (e.g. due to full cache queue) can lead to weird behavior for certain job dependencies (was: Ensure that the incomplete jobs with "cache service full" are properly restarted (take 2)) size:M added
Actions #2

Updated by okurz about 1 year ago

  • Priority changed from High to Low
Actions #3

Updated by livdywan about 1 year ago

  • Subject changed from Handle workers with busy cache service gracefully by a two-level wait to Handle workers with busy cache service gracefully by a two-level wait size:M
  • Status changed from New to Workable
Actions #4

Updated by okurz 8 months ago

  • Target version changed from Ready to Tools - Next
Actions #5

Updated by okurz 7 months ago

  • Target version changed from Tools - Next to Ready
Actions #6

Updated by okurz 6 months ago

  • Target version changed from Ready to Tools - Next
Actions #7

Updated by okurz 6 months ago

  • Target version changed from Tools - Next to Ready
Actions #8

Updated by mkittler 3 months ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler
Actions #10

Updated by mkittler 3 months ago

  • Status changed from In Progress to Feedback
Actions #11

Updated by mkittler 3 months ago · Edited

The last relevant job on o3 is from before my PR was merged (it was merged around 2024-03-07 15:45 and probably deployed soon after; is definitely deployed now):

openqa=# select id, t_finished from jobs where result = 'incomplete' and reason ilike '%cache%queue%full%' order by id desc limit 10;
   id    |     t_finished      
---------+---------------------
 3994496 | 2024-03-07 11:51:11
 3994495 | 2024-03-07 11:51:10
 3994485 | 2024-03-07 11:51:15
 3994480 | 2024-03-07 11:51:15
 3993860 | 2024-03-07 07:40:54
 3993859 | 2024-03-07 07:40:54
 3993856 | 2024-03-07 07:40:55
 3993855 | 2024-03-07 07:40:54
 3993684 | 2024-03-07 07:40:16
 3993678 | 2024-03-07 07:40:17
(10 rows)

I'll re-conduct the query next week to see for sure whether my PR helped.

I'll also check whether we now have any jobs running into MAX_SETUP_TIME. So far we don't have any:

openqa=# select id, t_finished from jobs where result = 'incomplete' and reason ilike '%MAX_SETUP_TIME%' order by id desc limit 10;
 id | t_finished 
----+------------
(0 rows)
Actions #12

Updated by mkittler 3 months ago · Edited

Looks like this wasn't enough:

openqa=# select id, t_finished from jobs where result = 'incomplete' and reason ilike '%cache%queue%full%' order by id desc limit 10;
   id    |     t_finished      
---------+---------------------
 4002675 | 2024-03-10 10:10:11
 4002485 | 2024-03-10 10:10:21
 4002368 | 2024-03-10 10:09:30
 4002357 | 2024-03-10 10:09:30
 4002354 | 2024-03-10 10:09:39
 4002339 | 2024-03-10 10:09:30
 4002335 | 2024-03-10 10:09:27
 4002330 | 2024-03-10 10:09:27
 4002281 | 2024-03-10 10:09:29
 4002275 | 2024-03-10 10:09:30

Probably this doesn't work at all because slots on e.g. openqaworker-arm22 were affected and that worker doesn't even have 46 slots - so the new hard-limit should never be reached on that machine.

Actions #13

Updated by mkittler 3 months ago

Looks like this affects only arm22:

openqa=# select id, t_finished, (select host from workers where workers.id = jobs.assigned_worker_id) from jobs where result = 'incomplete' and reason ilike '%cache%queue%full%' order by id desc limit 50;
   id    |     t_finished      |        host        
---------+---------------------+--------------------
 4002675 | 2024-03-10 10:10:11 | openqaworker-arm22
 4002485 | 2024-03-10 10:10:21 | openqaworker-arm22
 4002368 | 2024-03-10 10:09:30 | openqaworker-arm22
 4002357 | 2024-03-10 10:09:30 | openqaworker-arm22
 4002354 | 2024-03-10 10:09:39 | openqaworker-arm22
 4002339 | 2024-03-10 10:09:30 | openqaworker-arm22
 4002335 | 2024-03-10 10:09:27 | openqaworker-arm22
 4002330 | 2024-03-10 10:09:27 | openqaworker-arm22
 4002281 | 2024-03-10 10:09:29 | openqaworker-arm22
 4002275 | 2024-03-10 10:09:30 | openqaworker-arm22
 4002270 | 2024-03-10 10:09:30 | openqaworker-arm22
 4002262 | 2024-03-10 10:09:27 | openqaworker-arm22
 4002261 | 2024-03-10 10:09:31 | openqaworker-arm22
 4002257 | 2024-03-10 10:09:31 | openqaworker-arm22
 4002256 | 2024-03-10 10:09:31 | openqaworker-arm22
 4002255 | 2024-03-10 10:09:31 | openqaworker-arm22
 4002253 | 2024-03-10 10:09:29 | openqaworker-arm22
 4002248 | 2024-03-10 10:09:30 | openqaworker-arm22
 4000322 | 2024-03-09 17:29:28 | openqaworker-arm22
 4000305 | 2024-03-09 17:29:24 | openqaworker-arm22
 4000304 | 2024-03-09 17:29:24 | openqaworker-arm22
 4000301 | 2024-03-09 17:29:23 | openqaworker-arm22
 4000300 | 2024-03-09 17:29:27 | openqaworker-arm22
 4000286 | 2024-03-09 17:29:27 | openqaworker-arm22
 4000280 | 2024-03-09 17:29:28 | openqaworker-arm22
 4000225 | 2024-03-09 17:29:33 | openqaworker-arm22
 4000220 | 2024-03-09 17:29:28 | openqaworker-arm22
 4000216 | 2024-03-09 17:29:24 | openqaworker-arm22
 4000211 | 2024-03-09 17:29:28 | openqaworker-arm22
 4000204 | 2024-03-09 17:29:33 | openqaworker-arm22
 4000201 | 2024-03-09 17:29:55 | ip-10-252-32-28
 …
(50 rows)

And on that machine cat /usr/share/openqa/lib/OpenQA/CacheService/Response.pm shows that still the old version of the worker package (before my changes) is installed. This is actually a nice situation for verifying whether my change made a difference because now we see the difference between before and after the change and on e.g. arm21 (where the new version has already been deployed) no problematic jobs are produced anymore.

Note that the query for MAX_SETUP_TIME still returns zero rows so I also don't think we have a regression in that regard.

Actions #14

Updated by mkittler 3 months ago

Looks like arm22 was no longer updated because it had perl-Mojo-IOLoop-ReadWriteProcess and perl-Text-Glob from devel:openQA:Leap:* installed but we now rely on Leap repos for those. So I will updated it manually (allowing the vendor change) as soon as our repositories are good again¹.

(Currently the update is prevented by File './noarch/os-autoinst-distri-opensuse-deps-1.1710149510.aaa04ac0-lp155.13942.1.noarch.rpm' not found on medium 'http://download.opensuse.org/repositories/devel:openQA/15.5' and the error didn't go away after refreshing repos multiple times.)

Actions #15

Updated by mkittler 3 months ago

  • Status changed from Feedback to Resolved

I was able to update arm22 now. So I'm considering this resolved.

Actions

Also available in: Atom PDF