Project

General

Profile

Actions

action #12178

closed

worker can hang when killing isotovideo

Added by okurz almost 8 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Feature requests
Target version:
-
Start date:
2016-05-31
Due date:
% Done:

0%

Estimated time:

Description

observation

worker does not finish job on user_cancel, e.g. see http://lord.arch/tests/270, one module reported as "passed", next one as "running" but the job itself is "user_cancelled".
output from worker log:

POST http://localhost/api/v1/jobs/270/status
stopping livelog
## changing timer update_status
## removing timer update_status
## adding timer update_status 10
...
starting livelog
## changing timer update_status
## removing timer update_status
## adding timer update_status 0.5
checking backend state ...
waitpid 3521 returned 0
updating status
POST http://localhost/api/v1/jobs/270/status
...
checking backend state ...
waitpid 3521 returned 0
...
POST http://localhost/api/v1/jobs/270/status
received command: cancelstop_job cancel
## removing timer update_status
## removing timer check_backend
## removing timer job_timeout
killing 3521

then hangs. strace on process reveals that one "isotovideo" subprocess is still in a loop, see attached strace dump.

steps to reproduce

TBC, maybe call "user_cancel" very often

problem

race condition on shutdown

suggestion

  • check shutdown procedure for correctness
  • KILL subprocesses after TERM + timeout

workaround

send TERM to worker which should end the isotovideo process and not the worker itself. Otherwise, restart worker.


Files

isotovideo_hanging.out (22.1 KB) isotovideo_hanging.out okurz, 2016-05-31 13:08
strace_hanging_worker.out (28 Bytes) strace_hanging_worker.out okurz, 2016-05-31 13:08

Related issues 3 (0 open3 closed)

Related to openQA Project - action #12566: The Worker Dies When the Job is Cancelled from GUIResolvedcoolo2016-06-30

Actions
Has duplicate openQA Project - action #12940: worker not going downResolvedcoolo2016-07-30

Actions
Has duplicate openQA Project - action #13482: isotovideo process fails to die on job completion, worker becomes stuckClosed2016-08-27

Actions
Actions #2

Updated by okurz almost 8 years ago

  • Description updated (diff)
Actions #3

Updated by okurz almost 8 years ago

  • Status changed from New to In Progress

we assumed this would fixed with https://github.com/os-autoinst/os-autoinst/pull/519 but I just saw a problem with running isotovideo w/o a worker and it did not shutdown nicely by itself. Let's do some more testing.

Actions #4

Updated by coolo almost 8 years ago

I can still reproduce it ;(

stop_job quit
## removing timer update_status
## removing timer check_backend
## removing timer job_timeout
killing 16885

results in

17:13:25.1817 signalhandler 16885: got TERM

results in

coolo    16885 16883  0 19:02 pts/4    00:00:03 /usr/bin/perl -w /home/coolo/prod/os-autoinst/isotovideo -d
coolo    16887 16885  0 19:02 pts/4    00:00:00 [isotovideo] <defunct>
coolo    16888 16885 24 19:02 pts/4    00:02:45 [isotovideo] <defunct>
Actions #5

Updated by okurz almost 8 years ago

yes, me too. Sometimes your commit messages are … let's say … optimistic "Fix the backend exiting" ;-)

Actions #7

Updated by okurz over 7 years ago

  • Assignee set to coolo

still happens at least when cancelling an s390x job

Actions #8

Updated by okurz over 7 years ago

  • Related to action #12566: The Worker Dies When the Job is Cancelled from GUI added
Actions #9

Updated by okurz over 7 years ago

It looks like
s390x zVM does not like shutdown, see http://lord.arch/tests/2271, ideas?

worker log excerpt

POST http://localhost/api/v1/jobs/2271/status
checking backend state ...
waitpid 5994 returned 0
checking backend state ...
waitpid 5994 returned 0
checking backend state ...
waitpid 5994 returned 0
checking backend state ...
waitpid 5994 returned 0
checking backend state ...
waitpid 5994 returned 0
## removing timer job_timeout
max job time exceeded, aborting 00002271-sle-12-SP2-Server-DVD-s390x-Build2016-install_and_reboot@s390x-zVM-vswitch-l2 ...
stop_job timeout
## removing timer update_status
## removing timer check_backend
killing 5994
Actions #10

Updated by okurz over 7 years ago

Actions #11

Updated by okurz over 7 years ago

my proposal: https://github.com/os-autoinst/openQA/pull/802

So far I couldn't get to the underlying problem but I also encountered the same symptoms today on openqaw2:1

Aug 15 15:25:00 openqaw2 worker[14247]: 23088: WORKING 515670
Aug 16 00:22:22 openqaw2 worker[14247]: max job time exceeded, aborting 00515670-sle-12-SP2-Server-DVD-x...i ...
Aug 16 00:22:22 openqaw2 worker[14247]: killing 23088

stuck there for two days. I am pretty sure my change could help there, too.

Actions #12

Updated by okurz over 7 years ago

  • Priority changed from Normal to Urgent

job in same scenario got stuck again for multiple days. I consider this urgent but still my PR could help.

Actions #13

Updated by okurz over 7 years ago

  • Has duplicate action #13482: isotovideo process fails to die on job completion, worker becomes stuck added
Actions #14

Updated by okurz over 7 years ago

In #13482 AdamWill writes:

[…]
I'll give links to one example case, as they all seem to be identical. This is the job: https://openqa.stg.fedoraproject.org/tests/33900 . The worker service system log looks like this:

Aug 26 23:47:32 qa06.qa.fedoraproject.org worker[2508]: got job 33900: 00033900-fedora-Rawhide-Server-dvd-iso-x86_64-BuildFedora-Rawhide-20160826.n.1-install_updates_nfs@64bit
Aug 26 23:47:32 qa06.qa.fedoraproject.org worker[2508]: 3608: WORKING 33900
Aug 26 23:54:54 qa06.qa.fedoraproject.org worker[2508]: killing 3608

that's where they get stuck. At that point, 3608 - which is the isotovideo process - is still alive. In two of the cases I had to kill -9 it to make it go away (plain kill did not work), in one case, plain kill worked. Note that a 'normal' job closure looks like this:

Aug 26 23:47:28 qa06.qa.fedoraproject.org worker[2508]: killing 3444
Aug 26 23:47:28 qa06.qa.fedoraproject.org worker[2508]: waitpid returned error: No child processes
Aug 26 23:47:29 qa06.qa.fedoraproject.org worker[2508]: can't open /var/lib/openqa/pool/3/testresults/result-_software_selection.json: No such file or directory at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 534.
Aug 26 23:47:31 qa06.qa.fedoraproject.org worker[2508]: cleaning up 00033872-fedora-Rawhide-Atomic-boot-iso-x86_64-BuildFedora-Rawhide-20160826.n.1-install_default@64bit...
Aug 26 23:47:32 qa06.qa.fedoraproject.org worker[2508]: setting job 33872 to done

When I try to stop the service with systemd, I see this:

Aug 27 00:43:48 qa06.qa.fedoraproject.org systemd[1]: Stopping openQA Worker #3...
Aug 27 00:43:48 qa06.qa.fedoraproject.org worker[2508]: quit due to signal TERM
Aug 27 00:45:18 qa06.qa.fedoraproject.org systemd[1]: openqa-worker@3.service: State 'stop-sigterm' timed out. Killing.
Aug 27 00:45:18 qa06.qa.fedoraproject.org systemd[1]: openqa-worker@3.service: Main process exited, code=killed, status=9/KILL
Aug 27 00:45:18 qa06.qa.fedoraproject.org systemd[1]: Stopped openQA Worker #3.
Aug 27 00:45:18 qa06.qa.fedoraproject.org systemd[1]: openqa-worker@3.service: Unit entered failed state.
Aug 27 00:45:18 qa06.qa.fedoraproject.org systemd[1]: openqa-worker@3.service: Failed with result 'signal'.
Actions #15

Updated by coolo over 7 years ago

  • Status changed from In Progress to Resolved

I'm confident it's fixed now

Actions

Also available in: Atom PDF