Project

General

Profile

Actions

action #96623

closed

coordination #103944: [saga][epic] Scale up: More robust handling of diverse infrastructure with varying performance

coordination #98463: [epic] Avoid too slow asset downloads leading to jobs exceeding the timeout with or run into auto_review:"(timeout: setup exceeded MAX_SETUP_TIME|Cache service queue already full)":retry

Let workers declare themselves as broken if asset downloads are piling up size:M

Added by mkittler almost 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2021-08-06
Due date:
% Done:

0%

Estimated time:

Description

motivation / observation

This would prevent jobs to become incomplete due to exceeding MAX_SETUP_TIME because asset downloads take too long. We observed that this can happen in production on multiple worker hosts (see #96557) if too many asset downloads are piling up at once.

acceptance criteria

  • AC1: Worker slots show up as broken within the web UI (with a meaningful message) if a configurable number of cache_asset Minion jobs on that worker host are piling up (as inactive jobs). This prevents the scheduler from assigning jobs to that worker slot.
  • AC2: Worker slots show up as online again when the number of cache_asset Minion jobs decreases. This allows the scheduler to resume assigning jobs to that worker slot.
  • AC3: Workers where the cache service is turned off are unaffected.

suggestions

  • We already show workers as broken when the cache service is not available at all or no Minion workers are running. Check how that's done. The implementation of this ticket would most likely be an extension of this existing mechanism.

caveats

I'm not sure whether this will prevent the issue #96557 as it has some caveats:

  1. The scheduler can still assign many jobs to the same worker host at the same time while it has not declared itself as broken, e.g. when it has just been started.
  2. Inactive asset downloads would prevent any new jobs to be assigned, even those which assets would have already been cached.
  3. Due to 1. many inactive downloads can still pile up and this change doesn't help with removing inactive downloads from the queue when an openQA job runs into a timeout.

Maybe it would be better to make the scheduler smarter so it distributes jobs to worker hosts in a way to minimize the required asset downloads?


Related issues 1 (0 open1 closed)

Related to openQA Project - action #96557: jobs run into MAX_SETUP_TIME, one hour between 'Downloading' and 'Download processed' and no useful output in between auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retryResolvedmkittler2021-08-042021-08-19

Actions
Actions

Also available in: Atom PDF