Project

General

Profile

Actions

action #96623

closed

coordination #103944: [saga][epic] Scale up: More robust handling of diverse infrastructure with varying performance

coordination #98463: [epic] Avoid too slow asset downloads leading to jobs exceeding the timeout with or run into auto_review:"(timeout: setup exceeded MAX_SETUP_TIME|Cache service queue already full)":retry

Let workers declare themselves as broken if asset downloads are piling up size:M

Added by mkittler over 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2021-08-06
Due date:
% Done:

0%

Estimated time:

Description

motivation / observation

This would prevent jobs to become incomplete due to exceeding MAX_SETUP_TIME because asset downloads take too long. We observed that this can happen in production on multiple worker hosts (see #96557) if too many asset downloads are piling up at once.

acceptance criteria

  • AC1: Worker slots show up as broken within the web UI (with a meaningful message) if a configurable number of cache_asset Minion jobs on that worker host are piling up (as inactive jobs). This prevents the scheduler from assigning jobs to that worker slot.
  • AC2: Worker slots show up as online again when the number of cache_asset Minion jobs decreases. This allows the scheduler to resume assigning jobs to that worker slot.
  • AC3: Workers where the cache service is turned off are unaffected.

suggestions

  • We already show workers as broken when the cache service is not available at all or no Minion workers are running. Check how that's done. The implementation of this ticket would most likely be an extension of this existing mechanism.

caveats

I'm not sure whether this will prevent the issue #96557 as it has some caveats:

  1. The scheduler can still assign many jobs to the same worker host at the same time while it has not declared itself as broken, e.g. when it has just been started.
  2. Inactive asset downloads would prevent any new jobs to be assigned, even those which assets would have already been cached.
  3. Due to 1. many inactive downloads can still pile up and this change doesn't help with removing inactive downloads from the queue when an openQA job runs into a timeout.

Maybe it would be better to make the scheduler smarter so it distributes jobs to worker hosts in a way to minimize the required asset downloads?


Related issues 1 (0 open1 closed)

Related to openQA Project - action #96557: jobs run into MAX_SETUP_TIME, one hour between 'Downloading' and 'Download processed' and no useful output in between auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retryResolvedmkittler2021-08-042021-08-19

Actions
Actions #1

Updated by tinita over 3 years ago

  • Target version set to Ready
Actions #2

Updated by livdywan over 3 years ago

  • Related to action #96557: jobs run into MAX_SETUP_TIME, one hour between 'Downloading' and 'Download processed' and no useful output in between auto_review:"timeout: setup exceeded MAX_SETUP_TIME":retry added
Actions #3

Updated by livdywan over 3 years ago

  • Subject changed from Let workers declare themselves as broken if asset downloads are piling up to Let workers declare themselves as broken if asset downloads are piling up size:M
  • Status changed from New to Workable
Actions #4

Updated by mkittler over 3 years ago

  • Description updated (diff)
Actions #5

Updated by dheidler over 3 years ago

  • Status changed from Workable to In Progress
  • Assignee set to dheidler
Actions #6

Updated by openqa_review over 3 years ago

  • Due date set to 2021-08-26

Setting due date based on mean cycle time of SUSE QE Tools

Actions #7

Updated by dheidler over 3 years ago

  • Status changed from In Progress to Feedback
Actions #8

Updated by dheidler over 3 years ago

  • Status changed from Feedback to In Progress
Actions #9

Updated by dheidler about 3 years ago

  • Status changed from In Progress to Resolved
Actions #10

Updated by mkittler about 3 years ago

  • Parent task set to #98463
Actions #11

Updated by livdywan about 3 years ago

  • Due date deleted (2021-08-26)
Actions

Also available in: Atom PDF