action #12178: worker can hang when killing isotovideo - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #12178

closed

worker can hang when killing isotovideo

Added by okurz almost 9 years ago. Updated over 8 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

coolo

Category:

Feature requests

Target version:

Start date:

2016-05-31

Due date:

% Done:

Estimated time:

Description

observation¶

worker does not finish job on user_cancel, e.g. see http://lord.arch/tests/270, one module reported as "passed", next one as "running" but the job itself is "user_cancelled".
output from worker log:

POST http://localhost/api/v1/jobs/270/status
stopping livelog
## changing timer update_status
## removing timer update_status
## adding timer update_status 10
...
starting livelog
## changing timer update_status
## removing timer update_status
## adding timer update_status 0.5
checking backend state ...
waitpid 3521 returned 0
updating status
POST http://localhost/api/v1/jobs/270/status
...
checking backend state ...
waitpid 3521 returned 0
...
POST http://localhost/api/v1/jobs/270/status
received command: cancelstop_job cancel
## removing timer update_status
## removing timer check_backend
## removing timer job_timeout
killing 3521

then hangs. strace on process reveals that one "isotovideo" subprocess is still in a loop, see attached strace dump.

steps to reproduce¶

TBC, maybe call "user_cancel" very often

problem¶

race condition on shutdown

suggestion¶

check shutdown procedure for correctness
KILL subprocesses after TERM + timeout

workaround¶

send TERM to worker which should end the isotovideo process and not the worker itself. Otherwise, restart worker.

Files

Download all files

isotovideo_hanging.out (22.1 KB) isotovideo_hanging.out		okurz, 2016-05-31 13:08
strace_hanging_worker.out (28 Bytes) strace_hanging_worker.out		okurz, 2016-05-31 13:08

Related issues 3 (0 open — 3 closed)

Actions

Copy link Download all files

Updated by okurz almost 9 years ago

File isotovideo_hanging.out isotovideo_hanging.out added
File strace_hanging_worker.out strace_hanging_worker.out added
Description updated (diff)

Actions

Copy link

Updated by okurz almost 9 years ago

Description updated (diff)

Actions

Copy link

Updated by okurz almost 9 years ago

Status changed from New to In Progress

we assumed this would fixed with https://github.com/os-autoinst/os-autoinst/pull/519 but I just saw a problem with running isotovideo w/o a worker and it did not shutdown nicely by itself. Let's do some more testing.

Actions

Copy link

Updated by coolo almost 9 years ago

I can still reproduce it ;(

stop_job quit
## removing timer update_status
## removing timer check_backend
## removing timer job_timeout
killing 16885

results in

17:13:25.1817 signalhandler 16885: got TERM

results in

coolo    16885 16883  0 19:02 pts/4    00:00:03 /usr/bin/perl -w /home/coolo/prod/os-autoinst/isotovideo -d
coolo    16887 16885  0 19:02 pts/4    00:00:00 [isotovideo] <defunct>
coolo    16888 16885 24 19:02 pts/4    00:02:45 [isotovideo] <defunct>

Actions

Copy link

Updated by okurz almost 9 years ago

yes, me too. Sometimes your commit messages are … let's say … optimistic "Fix the backend exiting" ;-)

Actions

Copy link

Updated by coolo almost 9 years ago

https://github.com/os-autoinst/os-autoinst/pull/523 works in my case

Actions

Copy link

Updated by okurz almost 9 years ago

Assignee set to coolo

still happens at least when cancelling an s390x job

Actions

Copy link

Updated by okurz almost 9 years ago

Related to action #12566: The Worker Dies When the Job is Cancelled from GUI added

Actions

Copy link

Updated by okurz almost 9 years ago

It looks like
s390x zVM does not like shutdown, see http://lord.arch/tests/2271, ideas?

worker log excerpt

POST http://localhost/api/v1/jobs/2271/status
checking backend state ...
waitpid 5994 returned 0
checking backend state ...
waitpid 5994 returned 0
checking backend state ...
waitpid 5994 returned 0
checking backend state ...
waitpid 5994 returned 0
checking backend state ...
waitpid 5994 returned 0
## removing timer job_timeout
max job time exceeded, aborting 00002271-sle-12-SP2-Server-DVD-s390x-Build2016-install_and_reboot@s390x-zVM-vswitch-l2 ...
stop_job timeout
## removing timer update_status
## removing timer check_backend
killing 5994

Actions

Copy link

#10

Updated by okurz almost 9 years ago

Has duplicate action #12940: worker not going down added

Actions

Copy link

#11

Updated by okurz over 8 years ago

my proposal: https://github.com/os-autoinst/openQA/pull/802

So far I couldn't get to the underlying problem but I also encountered the same symptoms today on openqaw2:1

Aug 15 15:25:00 openqaw2 worker[14247]: 23088: WORKING 515670
Aug 16 00:22:22 openqaw2 worker[14247]: max job time exceeded, aborting 00515670-sle-12-SP2-Server-DVD-x...i ...
Aug 16 00:22:22 openqaw2 worker[14247]: killing 23088

stuck there for two days. I am pretty sure my change could help there, too.

Actions

Copy link

#12

Updated by okurz over 8 years ago

Priority changed from Normal to Urgent

job in same scenario got stuck again for multiple days. I consider this urgent but still my PR could help.

Actions

Copy link

#13

Updated by okurz over 8 years ago

Has duplicate action #13482: isotovideo process fails to die on job completion, worker becomes stuck added

Actions

Copy link

#14

Updated by okurz over 8 years ago

In #13482 AdamWill writes:

[…]
I'll give links to one example case, as they all seem to be identical. This is the job: https://openqa.stg.fedoraproject.org/tests/33900 . The worker service system log looks like this:

Aug 26 23:47:32 qa06.qa.fedoraproject.org worker[2508]: got job 33900: 00033900-fedora-Rawhide-Server-dvd-iso-x86_64-BuildFedora-Rawhide-20160826.n.1-install_updates_nfs@64bit
Aug 26 23:47:32 qa06.qa.fedoraproject.org worker[2508]: 3608: WORKING 33900
Aug 26 23:54:54 qa06.qa.fedoraproject.org worker[2508]: killing 3608

that's where they get stuck. At that point, 3608 - which is the isotovideo process - is still alive. In two of the cases I had to kill -9 it to make it go away (plain kill did not work), in one case, plain kill worked. Note that a 'normal' job closure looks like this:

Aug 26 23:47:28 qa06.qa.fedoraproject.org worker[2508]: killing 3444
Aug 26 23:47:28 qa06.qa.fedoraproject.org worker[2508]: waitpid returned error: No child processes
Aug 26 23:47:29 qa06.qa.fedoraproject.org worker[2508]: can't open /var/lib/openqa/pool/3/testresults/result-_software_selection.json: No such file or directory at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 534.
Aug 26 23:47:31 qa06.qa.fedoraproject.org worker[2508]: cleaning up 00033872-fedora-Rawhide-Atomic-boot-iso-x86_64-BuildFedora-Rawhide-20160826.n.1-install_default@64bit...
Aug 26 23:47:32 qa06.qa.fedoraproject.org worker[2508]: setting job 33872 to done

When I try to stop the service with systemd, I see this:

Aug 27 00:43:48 qa06.qa.fedoraproject.org systemd[1]: Stopping openQA Worker #3...
Aug 27 00:43:48 qa06.qa.fedoraproject.org worker[2508]: quit due to signal TERM
Aug 27 00:45:18 qa06.qa.fedoraproject.org systemd[1]: openqa-worker@3.service: State 'stop-sigterm' timed out. Killing.
Aug 27 00:45:18 qa06.qa.fedoraproject.org systemd[1]: openqa-worker@3.service: Main process exited, code=killed, status=9/KILL
Aug 27 00:45:18 qa06.qa.fedoraproject.org systemd[1]: Stopped openQA Worker #3.
Aug 27 00:45:18 qa06.qa.fedoraproject.org systemd[1]: openqa-worker@3.service: Unit entered failed state.
Aug 27 00:45:18 qa06.qa.fedoraproject.org systemd[1]: openqa-worker@3.service: Failed with result 'signal'.

Actions

Copy link

#15

Updated by coolo over 8 years ago

Status changed from In Progress to Resolved

I'm confident it's fixed now

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #12178

worker can hang when killing isotovideo

observation¶

steps to reproduce¶

problem¶

suggestion¶

workaround¶

Updated by okurz almost 9 years ago

Updated by okurz almost 9 years ago

Updated by okurz almost 9 years ago

Updated by coolo almost 9 years ago

Updated by okurz almost 9 years ago

Updated by coolo almost 9 years ago

Updated by okurz almost 9 years ago

Updated by okurz almost 9 years ago

Updated by okurz almost 9 years ago

Updated by okurz almost 9 years ago

Updated by okurz over 8 years ago

Updated by okurz over 8 years ago

Updated by okurz over 8 years ago

Updated by okurz over 8 years ago

Updated by coolo over 8 years ago