action #96623
closedcoordination #103944: [saga][epic] Scale up: More robust handling of diverse infrastructure with varying performance
coordination #98463: [epic] Avoid too slow asset downloads leading to jobs exceeding the timeout with or run into auto_review:"(timeout: setup exceeded MAX_SETUP_TIME|Cache service queue already full)":retry
Let workers declare themselves as broken if asset downloads are piling up size:M
Description
motivation / observation¶
This would prevent jobs to become incomplete due to exceeding MAX_SETUP_TIME
because asset downloads take too long. We observed that this can happen in production on multiple worker hosts (see #96557) if too many asset downloads are piling up at once.
acceptance criteria¶
- AC1: Worker slots show up as broken within the web UI (with a meaningful message) if a configurable number of
cache_asset
Minion jobs on that worker host are piling up (as inactive jobs). This prevents the scheduler from assigning jobs to that worker slot. - AC2: Worker slots show up as online again when the number of
cache_asset
Minion jobs decreases. This allows the scheduler to resume assigning jobs to that worker slot. - AC3: Workers where the cache service is turned off are unaffected.
suggestions¶
- We already show workers as broken when the cache service is not available at all or no Minion workers are running. Check how that's done. The implementation of this ticket would most likely be an extension of this existing mechanism.
caveats¶
I'm not sure whether this will prevent the issue #96557 as it has some caveats:
- The scheduler can still assign many jobs to the same worker host at the same time while it has not declared itself as broken, e.g. when it has just been started.
- Inactive asset downloads would prevent any new jobs to be assigned, even those which assets would have already been cached.
- Due to 1. many inactive downloads can still pile up and this change doesn't help with removing inactive downloads from the queue when an openQA job runs into a timeout.
Maybe it would be better to make the scheduler smarter so it distributes jobs to worker hosts in a way to minimize the required asset downloads?