QA (public) &raquo; openQA Project (public)

action #96684: Abort asset download via the cache service when related job runs into a timeout (or is otherwise cancelled) size:M

Rejected

2021-08-09

openQA Infrastructure (public) - action #97409: Re-use existing filesystems on workers after reboot if possible to prevent full worker asset cache re-syncing

New

openQA Infrastructure (public) - action #97412: Reduce I/O load on OSD by using more cache size on workers with using free disk space when available instead of hardcoded space

New

action #125276: Ensure that the incomplete jobs with "cache service full" are properly restarted size:M

Resolved

2023-03-02

action #128267: Restarting jobs (e.g. due to full cache queue) can lead to weird behavior for certain job dependencies (was: Ensure that the incomplete jobs with "cache service full" are properly restarted (take 2)) size:M

Resolved

action #128276: Handle workers with busy cache service gracefully by a two-level wait size:M

Resolved

2023-04-25

Updated by okurz over 3 years ago

Target version set to Ready

Actions

Updated by okurz over 3 years ago

Category set to Feature requests
Parent task set to #64746

Actions

Updated by mkittler over 3 years ago

Just for the record, we've just seen alerts again because downloads are piling up on openqaworker6 and openqaworker5. (The alert should actually not be firing for these kinds of broken workers so I'll have a look at the alerts query.)

Actions

Updated by mkittler over 3 years ago

Subject changed from [epic] Avoid too slow asset downloads leading to jobs exceeding the timeout with auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retry to [epic] Avoid too slow asset downloads leading to jobs exceeding the timeout with or run into auto_review:"(timeout: setup exceeded MAX_SETUP_TIME|Cache service queue already full)":retry
Description updated (diff)

Actions

Updated by okurz over 3 years ago

Assignee deleted (~~mkittler~~)
Target version changed from Ready to future

To me this looks less of a priority for us after we fixed the missing qcow compression, hence removing from backlog. Agreed?

Actions

Updated by okurz about 3 years ago

Parent task changed from #64746 to #103944

Actions

Updated by szarate about 2 years ago

So it looks like this is still happening, and from what DimStar is reporting... the retry is not doing the work as it should: https://openqa.opensuse.org/tests/overview?result=failed&result=incomplete&result=timeout_exceeded&distri=microos&distri=opensuse&version=Tumbleweed&build=20221215&groupid=1

On top of this, some of the restart of the parent's don't restart the children properly, leaving the overview in an inconsistent stage: https://suse.slack.com/archives/C02CANHLANP/p1671188124915829?thread_ts=1671188061.368929&cid=C02CANHLANP

Actions

Updated by okurz almost 2 years ago

Target version changed from future to Ready

Brought up by DimStar today again

Actions

Updated by okurz almost 2 years ago

The "cache service queue full" was introduced with openQA commit e16bdd68a as part of https://github.com/os-autoinst/openQA/pull/4122 during #96623

Actions

#10

Updated by okurz almost 2 years ago

Description updated (diff)

Actions

#11

Updated by okurz almost 2 years ago

Ideas from estimation call:

Ensure that openQA admins are notified if workers are reporting themselves as broken
Can we bump the number for OPENQA_CACHE_MAX_INACTIVE_JOBS?
Ensure that the incomplete jobs with "cache service full" are properly restarted -> #125276
As https://openqa.opensuse.org/admin/workers shows no broken workers at the moment we should ensure that admins are notified that workers are broken and/or workers stay broken for longer for people to realize

Actions