Actions
action #12178
closedworker can hang when killing isotovideo
Start date:
2016-05-31
Due date:
% Done:
0%
Estimated time:
Description
observation¶
worker does not finish job on user_cancel, e.g. see http://lord.arch/tests/270, one module reported as "passed", next one as "running" but the job itself is "user_cancelled".
output from worker log:
POST http://localhost/api/v1/jobs/270/status
stopping livelog
## changing timer update_status
## removing timer update_status
## adding timer update_status 10
...
starting livelog
## changing timer update_status
## removing timer update_status
## adding timer update_status 0.5
checking backend state ...
waitpid 3521 returned 0
updating status
POST http://localhost/api/v1/jobs/270/status
...
checking backend state ...
waitpid 3521 returned 0
...
POST http://localhost/api/v1/jobs/270/status
received command: cancelstop_job cancel
## removing timer update_status
## removing timer check_backend
## removing timer job_timeout
killing 3521
then hangs. strace on process reveals that one "isotovideo" subprocess is still in a loop, see attached strace dump.
steps to reproduce¶
TBC, maybe call "user_cancel" very often
problem¶
race condition on shutdown
suggestion¶
- check shutdown procedure for correctness
- KILL subprocesses after TERM + timeout
workaround¶
send TERM to worker which should end the isotovideo process and not the worker itself. Otherwise, restart worker.
Files
Actions