action #17264
closedWorkers are killed when test run encounters syntax error
0%
Description
I think ever since this commit, by coolo:
https://github.com/os-autoinst/openQA/commit/7c672e6e2b0cb4bc07892f6db2db330e5f98767b
Any time a job fails because some idiot monkey (ahem) screwed up his perl syntax again - like this one:
https://openqa.stg.fedoraproject.org/tests/72492
it seems like the job is duplicated, and then the worker process it ran on quits (it doesn't crash or die, it exits 0). This means that whenever I trigger such a job, a bunch of my workers exit and I have to go ssh into the worker host to restart the service.
Was this an intentional consequence of the change, or an oversight? I can't quite tell. This is what the logs look like:
Feb 22 22:33:36 qa09.qa.fedoraproject.org worker[41274]: [INFO] 5566: WORKING 72492
Feb 22 22:33:38 qa09.qa.fedoraproject.org worker[41274]: child 5566 died with exit status 256
Feb 22 22:33:42 qa09.qa.fedoraproject.org worker[41274]: can't open /var/lib/openqa/pool/1/testresults/test_order.json: No such file or directory at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 735.
Feb 22 22:33:44 qa09.qa.fedoraproject.org worker[41274]: [DEBUG] duplicating job 72492
Feb 22 22:33:44 qa09.qa.fedoraproject.org worker[41274]: [DEBUG] Either there is no job running or we were asked to stop: (1|Reason: no tests scheduled)
Feb 22 22:33:44 qa09.qa.fedoraproject.org worker[41274]: [INFO] cleaning up 00072492-fedora-25-updates-workstation-x86_64-BuildFEDORA-2017-87896dfb59-base_selinux@64bit
that's it, at that point it dies. I'm pretty sure it's because we hit the call to _stop_job_finish
with the $quit
arg set to 1, but I'm not sure why that's happening.
Note that https://github.com/os-autoinst/openQA/commit/819b41c0aa9db3dc4a00d7e1e1d74f0193f23739 changed the code a bit after coolo's commit, but I don't think it changes the logic in this case (i.e. it's coolo's commit that started this happening).