action #17264
closedWorkers are killed when test run encounters syntax error
0%
Description
I think ever since this commit, by coolo:
https://github.com/os-autoinst/openQA/commit/7c672e6e2b0cb4bc07892f6db2db330e5f98767b
Any time a job fails because some idiot monkey (ahem) screwed up his perl syntax again - like this one:
https://openqa.stg.fedoraproject.org/tests/72492
it seems like the job is duplicated, and then the worker process it ran on quits (it doesn't crash or die, it exits 0). This means that whenever I trigger such a job, a bunch of my workers exit and I have to go ssh into the worker host to restart the service.
Was this an intentional consequence of the change, or an oversight? I can't quite tell. This is what the logs look like:
Feb 22 22:33:36 qa09.qa.fedoraproject.org worker[41274]: [INFO] 5566: WORKING 72492
Feb 22 22:33:38 qa09.qa.fedoraproject.org worker[41274]: child 5566 died with exit status 256
Feb 22 22:33:42 qa09.qa.fedoraproject.org worker[41274]: can't open /var/lib/openqa/pool/1/testresults/test_order.json: No such file or directory at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 735.
Feb 22 22:33:44 qa09.qa.fedoraproject.org worker[41274]: [DEBUG] duplicating job 72492
Feb 22 22:33:44 qa09.qa.fedoraproject.org worker[41274]: [DEBUG] Either there is no job running or we were asked to stop: (1|Reason: no tests scheduled)
Feb 22 22:33:44 qa09.qa.fedoraproject.org worker[41274]: [INFO] cleaning up 00072492-fedora-25-updates-workstation-x86_64-BuildFEDORA-2017-87896dfb59-base_selinux@64bit
that's it, at that point it dies. I'm pretty sure it's because we hit the call to _stop_job_finish
with the $quit
arg set to 1, but I'm not sure why that's happening.
Note that https://github.com/os-autoinst/openQA/commit/819b41c0aa9db3dc4a00d7e1e1d74f0193f23739 changed the code a bit after coolo's commit, but I don't think it changes the logic in this case (i.e. it's coolo's commit that started this happening).
Updated by szarate almost 8 years ago
Looks like a variant of a setup falure (sort of): https://github.com/os-autoinst/openQA/pull/1199
Updated by AdamWill almost 8 years ago
- Status changed from New to Rejected
Hah! This is actually my fault, and not valid in upstream. I'm patching Fedora's openQA to change the condition for the 'duplicating job' block from:
if ($aborted eq 'quit')
to:
if ($aborted eq 'quit' || $aborted eq 'died')
because we want to auto-dupe jobs that die in Fedora (mainly now because of this annoying bug - https://bugzilla.redhat.com/show_bug.cgi?id=1403343 - which causes our qemu processes to just suddenly crash sometimes). With the original version of coolo's change - where the value of the $quit
arg was set by a $aborted eq 'quit'
check - this was still OK, but with the current version of the code - where the value of $quit
is just hardcoded to 1 inside this block - it causes my problem, because $quit
will get set to 1 for both 'died' and 'quit'.
I'll just adjust our downstream patch so it only sets the $quit
value to 1 if $aborted
was 'quit'...