action #12178
closedworker can hang when killing isotovideo
0%
Description
observation¶
worker does not finish job on user_cancel, e.g. see http://lord.arch/tests/270, one module reported as "passed", next one as "running" but the job itself is "user_cancelled".
output from worker log:
POST http://localhost/api/v1/jobs/270/status
stopping livelog
## changing timer update_status
## removing timer update_status
## adding timer update_status 10
...
starting livelog
## changing timer update_status
## removing timer update_status
## adding timer update_status 0.5
checking backend state ...
waitpid 3521 returned 0
updating status
POST http://localhost/api/v1/jobs/270/status
...
checking backend state ...
waitpid 3521 returned 0
...
POST http://localhost/api/v1/jobs/270/status
received command: cancelstop_job cancel
## removing timer update_status
## removing timer check_backend
## removing timer job_timeout
killing 3521
then hangs. strace on process reveals that one "isotovideo" subprocess is still in a loop, see attached strace dump.
steps to reproduce¶
TBC, maybe call "user_cancel" very often
problem¶
race condition on shutdown
suggestion¶
- check shutdown procedure for correctness
- KILL subprocesses after TERM + timeout
workaround¶
send TERM to worker which should end the isotovideo process and not the worker itself. Otherwise, restart worker.
Files
Updated by okurz almost 9 years ago
- File isotovideo_hanging.out isotovideo_hanging.out added
- File strace_hanging_worker.out strace_hanging_worker.out added
- Description updated (diff)
Updated by okurz over 8 years ago
- Status changed from New to In Progress
we assumed this would fixed with https://github.com/os-autoinst/os-autoinst/pull/519 but I just saw a problem with running isotovideo w/o a worker and it did not shutdown nicely by itself. Let's do some more testing.
Updated by coolo over 8 years ago
I can still reproduce it ;(
stop_job quit
## removing timer update_status
## removing timer check_backend
## removing timer job_timeout
killing 16885
results in
17:13:25.1817 signalhandler 16885: got TERM
results in
coolo 16885 16883 0 19:02 pts/4 00:00:03 /usr/bin/perl -w /home/coolo/prod/os-autoinst/isotovideo -d
coolo 16887 16885 0 19:02 pts/4 00:00:00 [isotovideo] <defunct>
coolo 16888 16885 24 19:02 pts/4 00:02:45 [isotovideo] <defunct>
Updated by okurz over 8 years ago
yes, me too. Sometimes your commit messages are … let's say … optimistic "Fix the backend exiting" ;-)
Updated by coolo over 8 years ago
https://github.com/os-autoinst/os-autoinst/pull/523 works in my case
Updated by okurz over 8 years ago
- Assignee set to coolo
still happens at least when cancelling an s390x job
Updated by okurz over 8 years ago
- Related to action #12566: The Worker Dies When the Job is Cancelled from GUI added
Updated by okurz over 8 years ago
It looks like
s390x zVM does not like shutdown, see http://lord.arch/tests/2271, ideas?
worker log excerpt
POST http://localhost/api/v1/jobs/2271/status
checking backend state ...
waitpid 5994 returned 0
checking backend state ...
waitpid 5994 returned 0
checking backend state ...
waitpid 5994 returned 0
checking backend state ...
waitpid 5994 returned 0
checking backend state ...
waitpid 5994 returned 0
## removing timer job_timeout
max job time exceeded, aborting 00002271-sle-12-SP2-Server-DVD-s390x-Build2016-install_and_reboot@s390x-zVM-vswitch-l2 ...
stop_job timeout
## removing timer update_status
## removing timer check_backend
killing 5994
Updated by okurz over 8 years ago
- Has duplicate action #12940: worker not going down added
Updated by okurz over 8 years ago
my proposal: https://github.com/os-autoinst/openQA/pull/802
So far I couldn't get to the underlying problem but I also encountered the same symptoms today on openqaw2:1
Aug 15 15:25:00 openqaw2 worker[14247]: 23088: WORKING 515670
Aug 16 00:22:22 openqaw2 worker[14247]: max job time exceeded, aborting 00515670-sle-12-SP2-Server-DVD-x...i ...
Aug 16 00:22:22 openqaw2 worker[14247]: killing 23088
stuck there for two days. I am pretty sure my change could help there, too.
Updated by okurz over 8 years ago
- Priority changed from Normal to Urgent
job in same scenario got stuck again for multiple days. I consider this urgent but still my PR could help.
Updated by okurz over 8 years ago
- Has duplicate action #13482: isotovideo process fails to die on job completion, worker becomes stuck added
Updated by okurz over 8 years ago
In #13482 AdamWill writes:
[…]
I'll give links to one example case, as they all seem to be identical. This is the job: https://openqa.stg.fedoraproject.org/tests/33900 . The worker service system log looks like this:
Aug 26 23:47:32 qa06.qa.fedoraproject.org worker[2508]: got job 33900: 00033900-fedora-Rawhide-Server-dvd-iso-x86_64-BuildFedora-Rawhide-20160826.n.1-install_updates_nfs@64bit
Aug 26 23:47:32 qa06.qa.fedoraproject.org worker[2508]: 3608: WORKING 33900
Aug 26 23:54:54 qa06.qa.fedoraproject.org worker[2508]: killing 3608
that's where they get stuck. At that point, 3608 - which is the isotovideo
process - is still alive. In two of the cases I had to kill -9
it to make it go away (plain kill
did not work), in one case, plain kill
worked. Note that a 'normal' job closure looks like this:
Aug 26 23:47:28 qa06.qa.fedoraproject.org worker[2508]: killing 3444
Aug 26 23:47:28 qa06.qa.fedoraproject.org worker[2508]: waitpid returned error: No child processes
Aug 26 23:47:29 qa06.qa.fedoraproject.org worker[2508]: can't open /var/lib/openqa/pool/3/testresults/result-_software_selection.json: No such file or directory at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 534.
Aug 26 23:47:31 qa06.qa.fedoraproject.org worker[2508]: cleaning up 00033872-fedora-Rawhide-Atomic-boot-iso-x86_64-BuildFedora-Rawhide-20160826.n.1-install_default@64bit...
Aug 26 23:47:32 qa06.qa.fedoraproject.org worker[2508]: setting job 33872 to done
When I try to stop the service with systemd, I see this:
Aug 27 00:43:48 qa06.qa.fedoraproject.org systemd[1]: Stopping openQA Worker #3...
Aug 27 00:43:48 qa06.qa.fedoraproject.org worker[2508]: quit due to signal TERM
Aug 27 00:45:18 qa06.qa.fedoraproject.org systemd[1]: openqa-worker@3.service: State 'stop-sigterm' timed out. Killing.
Aug 27 00:45:18 qa06.qa.fedoraproject.org systemd[1]: openqa-worker@3.service: Main process exited, code=killed, status=9/KILL
Aug 27 00:45:18 qa06.qa.fedoraproject.org systemd[1]: Stopped openQA Worker #3.
Aug 27 00:45:18 qa06.qa.fedoraproject.org systemd[1]: openqa-worker@3.service: Unit entered failed state.
Aug 27 00:45:18 qa06.qa.fedoraproject.org systemd[1]: openqa-worker@3.service: Failed with result 'signal'.
Updated by coolo over 8 years ago
- Status changed from In Progress to Resolved
I'm confident it's fixed now