Project

General

Profile

Actions

action #17264

closed

Workers are killed when test run encounters syntax error

Added by AdamWill about 7 years ago. Updated about 7 years ago.

Status:
Rejected
Priority:
High
Assignee:
-
Category:
Feature requests
Target version:
-
Start date:
2017-02-22
Due date:
% Done:

0%

Estimated time:

Description

I think ever since this commit, by coolo:

https://github.com/os-autoinst/openQA/commit/7c672e6e2b0cb4bc07892f6db2db330e5f98767b

Any time a job fails because some idiot monkey (ahem) screwed up his perl syntax again - like this one:

https://openqa.stg.fedoraproject.org/tests/72492

it seems like the job is duplicated, and then the worker process it ran on quits (it doesn't crash or die, it exits 0). This means that whenever I trigger such a job, a bunch of my workers exit and I have to go ssh into the worker host to restart the service.

Was this an intentional consequence of the change, or an oversight? I can't quite tell. This is what the logs look like:

Feb 22 22:33:36 qa09.qa.fedoraproject.org worker[41274]: [INFO] 5566: WORKING 72492
Feb 22 22:33:38 qa09.qa.fedoraproject.org worker[41274]: child 5566 died with exit status 256
Feb 22 22:33:42 qa09.qa.fedoraproject.org worker[41274]: can't open /var/lib/openqa/pool/1/testresults/test_order.json: No such file or directory at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 735.
Feb 22 22:33:44 qa09.qa.fedoraproject.org worker[41274]: [DEBUG] duplicating job 72492
Feb 22 22:33:44 qa09.qa.fedoraproject.org worker[41274]: [DEBUG] Either there is no job running or we were asked to stop: (1|Reason: no tests scheduled)
Feb 22 22:33:44 qa09.qa.fedoraproject.org worker[41274]: [INFO] cleaning up 00072492-fedora-25-updates-workstation-x86_64-BuildFEDORA-2017-87896dfb59-base_selinux@64bit

that's it, at that point it dies. I'm pretty sure it's because we hit the call to _stop_job_finish with the $quit arg set to 1, but I'm not sure why that's happening.

Note that https://github.com/os-autoinst/openQA/commit/819b41c0aa9db3dc4a00d7e1e1d74f0193f23739 changed the code a bit after coolo's commit, but I don't think it changes the logic in this case (i.e. it's coolo's commit that started this happening).

Actions

Also available in: Atom PDF