action #96623: Let workers declare themselves as broken if asset downloads are piling up size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

action #96623

closed

coordination #103944: [saga][epic] Scale up: More robust handling of diverse infrastructure with varying performance

coordination #98463: [epic] Avoid too slow asset downloads leading to jobs exceeding the timeout with or run into auto_review:"(timeout: setup exceeded MAX_SETUP_TIME|Cache service queue already full)":retry

Let workers declare themselves as broken if asset downloads are piling up size:M

Added by mkittler almost 4 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

dheidler

Category:

Feature requests

Target version:

Ready

Start date:

2021-08-06

Due date:

% Done:

Estimated time:

Description

motivation / observation¶

This would prevent jobs to become incomplete due to exceeding MAX_SETUP_TIME because asset downloads take too long. We observed that this can happen in production on multiple worker hosts (see #96557) if too many asset downloads are piling up at once.

acceptance criteria¶

AC1: Worker slots show up as broken within the web UI (with a meaningful message) if a configurable number of cache_asset Minion jobs on that worker host are piling up (as inactive jobs). This prevents the scheduler from assigning jobs to that worker slot.
AC2: Worker slots show up as online again when the number of cache_asset Minion jobs decreases. This allows the scheduler to resume assigning jobs to that worker slot.
AC3: Workers where the cache service is turned off are unaffected.

suggestions¶

We already show workers as broken when the cache service is not available at all or no Minion workers are running. Check how that's done. The implementation of this ticket would most likely be an extension of this existing mechanism.

caveats¶

I'm not sure whether this will prevent the issue #96557 as it has some caveats:

The scheduler can still assign many jobs to the same worker host at the same time while it has not declared itself as broken, e.g. when it has just been started.
Inactive asset downloads would prevent any new jobs to be assigned, even those which assets would have already been cached.
Due to 1. many inactive downloads can still pile up and this change doesn't help with removing inactive downloads from the queue when an openQA job runs into a timeout.

Maybe it would be better to make the scheduler smarter so it distributes jobs to worker hosts in a way to minimize the required asset downloads?

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #96623

Let workers declare themselves as broken if asset downloads are piling up size:M

motivation / observation¶

acceptance criteria¶

suggestions¶

caveats¶

Updated by tinita over 3 years ago

Updated by livdywan over 3 years ago

Updated by livdywan over 3 years ago

Updated by mkittler over 3 years ago

Updated by dheidler over 3 years ago

Updated by openqa_review over 3 years ago

Updated by dheidler over 3 years ago

Updated by dheidler over 3 years ago

Updated by dheidler over 3 years ago

Updated by mkittler over 3 years ago

Updated by livdywan over 3 years ago