Project

General

Profile

action #69784

Updated by okurz over 3 years ago

## Observation 
 When a worker does not disconnect gracefully it is supposed to be considered offline after `WORKERS_CHECKER_THRESHOLD` seconds. This mechanism doesn't work when there are still jobs which can be assigned to that worker because then the scheduler possibly attempts to assign jobs to that worker and therefore updates the `t_updated` column of the worker which is also used to track the last activity of the worker. So the worker keeps appearing online. This issue prevents the stale job detection for jobs where the worker really never appears again to work. 

 This can currently be observed with openqaworker7 on OSD (which has been an o3 worker since months): see #64514#note-6 

 ## Impact 

 * stale job detection ineffective on according jobs 
 * misleading information to users about the according workers 


 ## Notes 
 This was an oversight when solving #27454 and #57017 and increasing `WORKERS_CHECKER_THRESHOLD` to make the stale job detection less aggressive.   

 ## Suggestions 
 The scheduler should preserve the `t_updated` column or we keep track of the workers activity using an additional column. I don't think it would be a good idea to move back to the previous approach of tracking the worker activity only within the web socket server.

Back