Project

General

Profile

action #96623

Updated by mkittler almost 3 years ago

### motivation / observation 
 This would prevent jobs to become incomplete due to exceeding `MAX_SETUP_TIME` because asset downloads take too long. We observed that this can happen in production on multiple worker hosts (see #96557) if too many asset downloads are piling up at once. 

 ### acceptance criteria 
 * **AC1:** Worker slots show up as broken within the web UI (with a meaningful message) if a configurable number of `cache_asset` Minion jobs on that worker host are piling up (as inactive jobs). This prevents the scheduler from assigning jobs to that worker slot. 
 * **AC2:** Worker slots show up as online again when the number of `cache_asset` Minion jobs decreases. This allows the scheduler to resume assigning jobs to that worker slot. 
 * **AC3:** Workers where the cache service is turned off are unaffected. 

 ### suggestions 
 * We already show workers as broken when the cache service is not available at all or no Minion workers are running. Check how that's done. The implementation of this ticket would most likely be an extension of this existing mechanism. 

 ### caveats 
 I'm not sure whether this will prevent the issue #96557 as it has some caveats: 

 1. The scheduler can still assign many jobs to the same worker host at the same time while it has not declared itself as broken, e.g. when it has just been started. 
 2. Inactive asset downloads would prevent any new jobs to be assigned, even those which assets would have already been cached. 
 3. Due to 1. many inactive downloads can still pile up and this change doesn't help with removing inactive downloads from the queue when an openQA job runs into a timeout. 

 Maybe it would be better to make the scheduler smarter so it distributes jobs to worker hosts in a way to minimize the required asset downloads? 

Back