Project

General

Profile

action #12178

Updated by okurz almost 8 years ago

## observation 
 worker does not finish job on user_cancel, e.g. see http://lord.arch/tests/270, one module reported as "passed", next one as "running" but the job itself is "user_cancelled". 
 output from worker log: 

 
 ``` 
 POST http://localhost/api/v1/jobs/270/status 
 stopping livelog 
 ## changing timer update_status 
 ## removing timer update_status 
 ## adding timer update_status 10 
 ... 
 starting livelog 
 ## changing timer update_status 
 ## removing timer update_status 
 ## adding timer update_status 0.5 
 checking backend state ... 
 waitpid 3521 returned 0 
 updating status 
 POST http://localhost/api/v1/jobs/270/status 
 ... 
 checking backend state ... 
 waitpid 3521 returned 0 
 ... 
 POST http://localhost/api/v1/jobs/270/status 
 received command: cancelstop_job cancel 
 ## removing timer update_status 
 ## removing timer check_backend 
 ## removing timer job_timeout 
 killing 3521 
 ``` 
 then hangs. strace on process reveals that one "isotovideo" subprocess is still in a loop, see attached strace dump. 

 ## steps to reproduce 
 TBC, maybe call "user_cancel" very often 

 ## problem 
 race condition on shutdown 

 ## suggestion 
 * check shutdown procedure for correctness 
 * KILL subprocesses after TERM + timeout

Back