action #44162
closedVarious tests stayed 'running' for ~ 4 hours or longer
0%
Description
Sample test:
https://openqa.opensuse.org/tests/801664 (raid10/TW) and https://openqa.opensuse.org/tests/801657 (cryptlvm/leap15.1)
it was a clone of an earlier one, that finished incomplete.
The restarted job staid around for 4 hours, but did not really make progress (and the test usually finishes much quicker. normaly runtime of RAID10/TW is 30 - 45 minutes)
Updated by okurz about 6 years ago
- Related to action #44105: if workercache dies, we get *tons* of incompletes added
Updated by okurz about 6 years ago
- Subject changed from Various tests staid 'running' for ~ 4 hours to Various tests stayed 'running' for ~ 4 hours
Updated by okurz over 5 years ago
- Subject changed from Various tests stayed 'running' for ~ 4 hours to Various tests stayed 'running' for ~ 4 hours or longer
Let me hijack this ticket to reference most recent examples on OSD which run for > 6 days (!):
- https://openqa.suse.de/tests/3218171
- https://openqa.suse.de/tests/3218169
- https://openqa.suse.de/tests/3218170
- https://openqa.suse.de/tests/3218172
All of them have been cloned to newer jobs already and they do not even block the workers anymore as the assigned worker already executed other jobs just fine. Cancelling the job over web UI does not work, restarting the worker instance systemd job also not successful. A manual deletion of the job does work but I haven't executed that on theabove jobs, just an older one last week.
Updated by okurz almost 5 years ago
- Status changed from New to Resolved
- Assignee set to okurz
Since then we have improved stall detection, worker handling, refactored the worker-websockets-webui connection. Haven't observed this lately so I think we have it actually covered.