action #12178
Updated by okurz almost 9 years ago
## observation
worker does not finish job on user_cancel, e.g. see http://lord.arch/tests/270, one module reported as "passed", next one as "running" but the job itself is "user_cancelled".
output from worker log:
```
POST http://localhost/api/v1/jobs/270/status
stopping livelog
## changing timer update_status
## removing timer update_status
## adding timer update_status 10
...
starting livelog
## changing timer update_status
## removing timer update_status
## adding timer update_status 0.5
checking backend state ...
waitpid 3521 returned 0
updating status
POST http://localhost/api/v1/jobs/270/status
...
checking backend state ...
waitpid 3521 returned 0
...
POST http://localhost/api/v1/jobs/270/status
received command: cancelstop_job cancel
## removing timer update_status
## removing timer check_backend
## removing timer job_timeout
killing 3521
```
then hangs. strace on process reveals that one "isotovideo" subprocess is still in a loop, see attached strace dump.
## steps to reproduce
TBC, maybe call "user_cancel" very often
## problem
race condition on shutdown
## suggestion
* check shutdown procedure for correctness
* KILL subprocesses after TERM + timeout
## workaround
send TERM to worker which should end the isotovideo process and not the worker itself. Otherwise, restart worker.