Project

General

Profile

Actions

coordination #98463

open

coordination #103944: [saga][epic] Scale up: More robust handling of diverse infrastructure with varying performance

[epic] Avoid too slow asset downloads leading to jobs exceeding the timeout with or run into auto_review:"(timeout: setup exceeded MAX_SETUP_TIME|Cache service queue already full)":retry

Added by mkittler about 3 years ago. Updated 4 months ago.

Status:
Blocked
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2021-08-06
Due date:
% Done:

71%

Estimated time:
(Total: 0.00 h)

Description

problem and scope

This epic is about the general problem that asset downloads can be quite slow leading to jobs exceeding MAX_SETUP_TIME or being incompleted with Cache service queue already full; it is not about worker host specific problems, e.g. broken filesystem or networking problems.

ideas to improve

There are multiple factors contributing to the problem so there's not one simple fix. Here is a list of the areas where we have room for improvement (feel free to add more items):

  1. The file system on OSD workers is re-created on every reboot so the cache needs to be completely renewed on every reboot. Hence this problem is almost only apparent on OSD (but not on o3).
  2. We would also benefit from using a bigger asset cache (although without 1. being addressed it is likely not of that much use)
  3. We should avoid processing downloads when their jobs have exceeded the timeout anyways. This of course only improves handling the symptom of the problem and might not be very useful anymore once the problem itself is fixed.
  4. We could try to tweak the parameter OPENQA_CACHE_MAX_INACTIVE_JOBS.

acceptance criteria

  • AC1: The figures for jobs exceeding MAX_SETUP_TIME are significantly lower than the ones mentioned under "further details" below. A specific worker host causing problems for reasons specific to that machine it is out of scope, though.

further details

Multiple worker hosts are affected:

openqa=> select host, count(id) as online_slots, (select array[count(distinct id), count(distinct id) / (extract(epoch FROM (timezone('UTC', now()) - '2021-09-07T00:00:00')) / 3600)] from jobs join jobs_assets on jobs.id = jobs_assets.job_id where assigned_worker_id = any(array_agg(w.id)) and t_finished >= '2021-09-07T00:00:00' and reason like '%setup exceeded MAX_SETUP_TIME%') as recently_abandoned_jobs_total_and_per_hour from workers as w where t_updated > (timezone('UTC', now()) - interval '1 hour') group by host order by recently_abandoned_jobs_total_and_per_hour desc;
        host         | online_slots | recently_abandoned_jobs_total_and_per_hour 
---------------------+--------------+--------------------------------------------
 openqaworker5       |           41 | {14,0.167352897235061}
 openqaworker6       |           29 | {12,0.143445340487195}
 openqaworker13      |           16 | {9,0.107584005365396}
 openqaworker3       |           19 | {5,0.0597688918696647}
 openqaworker8       |           16 | {5,0.0597688918696647}
 openqaworker9       |           16 | {5,0.0597688918696647}
 QA-Power8-5-kvm     |            8 | {3,0.0358613351217988}
 openqaworker11      |           10 | {0,0}
 openqaworker2       |           34 | {0,0}
 QA-Power8-4-kvm     |            8 | {0,0}
 powerqaworker-qam-1 |            8 | {0,0}
 automotive-3        |            1 | {0,0}
 grenache-1          |           50 | {0,0}
 malbec              |            4 | {0,0}
 openqaworker-arm-1  |           10 | {0,0}
 openqaworker-arm-2  |           20 | {0,0}
 openqaworker10      |           10 | {0,0}
(17 Zeilen)

The ones which are affected most are also the ones needing the most assets:

openqa=> select host, count(id) as online_slots, (select array[((select sum(size) from assets where id = any(array_agg(distinct jobs_assets.asset_id))) / 1024 / 1024 / 1024), count(distinct id)] from jobs join jobs_assets on jobs.id = jobs_assets.job_id where assigned_worker_id = any(array_agg(w.id)) and t_finished >= '2021-09-07T00:00:00') as recent_asset_size_in_gb_and_job_count from workers as w where t_updated > (timezone('UTC', now()) - interval '1 hour') group by host order by recent_asset_size_in_gb_and_job_count desc;
        host         | online_slots | recent_asset_size_in_gb_and_job_count 
---------------------+--------------+---------------------------------------
 openqaworker11      |           10 | {NULL,0}
 automotive-3        |            1 | {NULL,0}
 openqaworker6       |           29 | {1739.5315849324688340,3444}
 openqaworker5       |           41 | {1668.8964441129937744,3665}
 openqaworker13      |           16 | {1591.4191119810566328,2221}
 openqaworker8       |           16 | {1487.1783863399177842,2531}
 openqaworker3       |           19 | {1447.2926171422004697,2350}
 openqaworker9       |           16 | {1368.1286235852167031,2380}
 openqaworker10      |           10 | {1117.2662402801215645,1706}
 openqaworker2       |           34 | {781.0186277972534277,718}
 grenache-1          |           50 | {663.5168796060606865,1477}
 openqaworker-arm-2  |           20 | {346.2731295535340879,1123}
 openqaworker-arm-1  |           10 | {332.1729393638670449,614}
 QA-Power8-5-kvm     |            8 | {239.5352552458643916,298}
 powerqaworker-qam-1 |            8 | {238.9669120963662910,361}
 QA-Power8-4-kvm     |            8 | {223.1794419540092373,297}
 malbec              |            4 | {187.9319233968853955,141}
(17 Zeilen)

Subtasks 7 (2 open5 closed)

action #96623: Let workers declare themselves as broken if asset downloads are piling up size:MResolveddheidler2021-08-06

Actions
action #96684: Abort asset download via the cache service when related job runs into a timeout (or is otherwise cancelled) size:MRejectedmkittler2021-08-09

Actions
openQA Infrastructure - action #97409: Re-use existing filesystems on workers after reboot if possible to prevent full worker asset cache re-syncingNew

Actions
openQA Infrastructure - action #97412: Reduce I/O load on OSD by using more cache size on workers with using free disk space when available instead of hardcoded spaceNew

Actions
action #125276: Ensure that the incomplete jobs with "cache service full" are properly restarted size:MResolvedmkittler2023-03-02

Actions
action #128267: Restarting jobs (e.g. due to full cache queue) can lead to weird behavior for certain job dependencies (was: Ensure that the incomplete jobs with "cache service full" are properly restarted (take 2)) size:MResolvedmkittler

Actions
action #128276: Handle workers with busy cache service gracefully by a two-level wait size:MResolvedmkittler2023-04-25

Actions
Actions #1

Updated by okurz about 3 years ago

  • Target version set to Ready
Actions #2

Updated by okurz about 3 years ago

  • Category set to Feature requests
  • Parent task set to #64746
Actions #3

Updated by mkittler about 3 years ago

Just for the record, we've just seen alerts again because downloads are piling up on openqaworker6 and openqaworker5. (The alert should actually not be firing for these kinds of broken workers so I'll have a look at the alerts query.)

Actions #4

Updated by mkittler about 3 years ago

  • Subject changed from [epic] Avoid too slow asset downloads leading to jobs exceeding the timeout with auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retry to [epic] Avoid too slow asset downloads leading to jobs exceeding the timeout with or run into auto_review:"(timeout: setup exceeded MAX_SETUP_TIME|Cache service queue already full)":retry
  • Description updated (diff)
Actions #5

Updated by okurz about 3 years ago

  • Assignee deleted (mkittler)
  • Target version changed from Ready to future

To me this looks less of a priority for us after we fixed the missing qcow compression, hence removing from backlog. Agreed?

Actions #6

Updated by okurz almost 3 years ago

  • Parent task changed from #64746 to #103944
Actions #7

Updated by szarate almost 2 years ago

So it looks like this is still happening, and from what DimStar is reporting... the retry is not doing the work as it should: https://openqa.opensuse.org/tests/overview?result=failed&result=incomplete&result=timeout_exceeded&distri=microos&distri=opensuse&version=Tumbleweed&build=20221215&groupid=1

On top of this, some of the restart of the parent's don't restart the children properly, leaving the overview in an inconsistent stage: https://suse.slack.com/archives/C02CANHLANP/p1671188124915829?thread_ts=1671188061.368929&cid=C02CANHLANP

Actions #8

Updated by okurz over 1 year ago

  • Target version changed from future to Ready

Brought up by DimStar today again

Actions #9

Updated by okurz over 1 year ago

The "cache service queue full" was introduced with openQA commit e16bdd68a as part of https://github.com/os-autoinst/openQA/pull/4122 during #96623

Actions #10

Updated by okurz over 1 year ago

  • Description updated (diff)
Actions #11

Updated by okurz over 1 year ago

Ideas from estimation call:

  1. Ensure that openQA admins are notified if workers are reporting themselves as broken
  2. Can we bump the number for OPENQA_CACHE_MAX_INACTIVE_JOBS?
  3. Ensure that the incomplete jobs with "cache service full" are properly restarted -> #125276
  4. As https://openqa.opensuse.org/admin/workers shows no broken workers at the moment we should ensure that admins are notified that workers are broken and/or workers stay broken for longer for people to realize
Actions #12

Updated by okurz over 1 year ago

  • Status changed from New to Blocked
  • Assignee set to okurz

We are looking into #125276 first

Actions #13

Updated by okurz over 1 year ago

  • Status changed from Blocked to New
  • Assignee deleted (okurz)

#125276 completed, work can be continued here

Actions #14

Updated by mkittler over 1 year ago

  • Tracker changed from action to coordination
  • Status changed from New to Blocked
  • Assignee set to mkittler

Blocked by #128276 or #96684

Actions #15

Updated by livdywan about 1 year ago

  • Assignee changed from mkittler to kraih
Actions #16

Updated by okurz about 1 year ago

  • Target version changed from Ready to Tools - Next
Actions

Also available in: Atom PDF