action #44162
closed
- Related to action #44105: if workercache dies, we get *tons* of incompletes added
- Subject changed from Various tests staid 'running' for ~ 4 hours to Various tests stayed 'running' for ~ 4 hours
- Category set to Regressions/Crashes
- Subject changed from Various tests stayed 'running' for ~ 4 hours to Various tests stayed 'running' for ~ 4 hours or longer
Let me hijack this ticket to reference most recent examples on OSD which run for > 6 days (!):
All of them have been cloned to newer jobs already and they do not even block the workers anymore as the assigned worker already executed other jobs just fine. Cancelling the job over web UI does not work, restarting the worker instance systemd job also not successful. A manual deletion of the job does work but I haven't executed that on theabove jobs, just an older one last week.
- Status changed from New to Resolved
- Assignee set to okurz
Since then we have improved stall detection, worker handling, refactored the worker-websockets-webui connection. Haven't observed this lately so I think we have it actually covered.
Also available in: Atom
PDF