Project

General

Profile

Actions

action #12178

closed

worker can hang when killing isotovideo

Added by okurz almost 8 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Feature requests
Target version:
-
Start date:
2016-05-31
Due date:
% Done:

0%

Estimated time:

Description

observation

worker does not finish job on user_cancel, e.g. see http://lord.arch/tests/270, one module reported as "passed", next one as "running" but the job itself is "user_cancelled".
output from worker log:

POST http://localhost/api/v1/jobs/270/status
stopping livelog
## changing timer update_status
## removing timer update_status
## adding timer update_status 10
...
starting livelog
## changing timer update_status
## removing timer update_status
## adding timer update_status 0.5
checking backend state ...
waitpid 3521 returned 0
updating status
POST http://localhost/api/v1/jobs/270/status
...
checking backend state ...
waitpid 3521 returned 0
...
POST http://localhost/api/v1/jobs/270/status
received command: cancelstop_job cancel
## removing timer update_status
## removing timer check_backend
## removing timer job_timeout
killing 3521

then hangs. strace on process reveals that one "isotovideo" subprocess is still in a loop, see attached strace dump.

steps to reproduce

TBC, maybe call "user_cancel" very often

problem

race condition on shutdown

suggestion

  • check shutdown procedure for correctness
  • KILL subprocesses after TERM + timeout

workaround

send TERM to worker which should end the isotovideo process and not the worker itself. Otherwise, restart worker.


Files

isotovideo_hanging.out (22.1 KB) isotovideo_hanging.out okurz, 2016-05-31 13:08
strace_hanging_worker.out (28 Bytes) strace_hanging_worker.out okurz, 2016-05-31 13:08

Related issues 3 (0 open3 closed)

Related to openQA Project - action #12566: The Worker Dies When the Job is Cancelled from GUIResolvedcoolo2016-06-30

Actions
Has duplicate openQA Project - action #12940: worker not going downResolvedcoolo2016-07-30

Actions
Has duplicate openQA Project - action #13482: isotovideo process fails to die on job completion, worker becomes stuckClosed2016-08-27

Actions
Actions

Also available in: Atom PDF