Project

General

Profile

Actions

action #17264

closed

Workers are killed when test run encounters syntax error

Added by AdamWill about 7 years ago. Updated about 7 years ago.

Status:
Rejected
Priority:
High
Assignee:
-
Category:
Feature requests
Target version:
-
Start date:
2017-02-22
Due date:
% Done:

0%

Estimated time:

Description

I think ever since this commit, by coolo:

https://github.com/os-autoinst/openQA/commit/7c672e6e2b0cb4bc07892f6db2db330e5f98767b

Any time a job fails because some idiot monkey (ahem) screwed up his perl syntax again - like this one:

https://openqa.stg.fedoraproject.org/tests/72492

it seems like the job is duplicated, and then the worker process it ran on quits (it doesn't crash or die, it exits 0). This means that whenever I trigger such a job, a bunch of my workers exit and I have to go ssh into the worker host to restart the service.

Was this an intentional consequence of the change, or an oversight? I can't quite tell. This is what the logs look like:

Feb 22 22:33:36 qa09.qa.fedoraproject.org worker[41274]: [INFO] 5566: WORKING 72492
Feb 22 22:33:38 qa09.qa.fedoraproject.org worker[41274]: child 5566 died with exit status 256
Feb 22 22:33:42 qa09.qa.fedoraproject.org worker[41274]: can't open /var/lib/openqa/pool/1/testresults/test_order.json: No such file or directory at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 735.
Feb 22 22:33:44 qa09.qa.fedoraproject.org worker[41274]: [DEBUG] duplicating job 72492
Feb 22 22:33:44 qa09.qa.fedoraproject.org worker[41274]: [DEBUG] Either there is no job running or we were asked to stop: (1|Reason: no tests scheduled)
Feb 22 22:33:44 qa09.qa.fedoraproject.org worker[41274]: [INFO] cleaning up 00072492-fedora-25-updates-workstation-x86_64-BuildFEDORA-2017-87896dfb59-base_selinux@64bit

that's it, at that point it dies. I'm pretty sure it's because we hit the call to _stop_job_finish with the $quit arg set to 1, but I'm not sure why that's happening.

Note that https://github.com/os-autoinst/openQA/commit/819b41c0aa9db3dc4a00d7e1e1d74f0193f23739 changed the code a bit after coolo's commit, but I don't think it changes the logic in this case (i.e. it's coolo's commit that started this happening).

Actions #1

Updated by szarate about 7 years ago

Looks like a variant of a setup falure (sort of): https://github.com/os-autoinst/openQA/pull/1199

Actions #2

Updated by AdamWill about 7 years ago

  • Status changed from New to Rejected

Hah! This is actually my fault, and not valid in upstream. I'm patching Fedora's openQA to change the condition for the 'duplicating job' block from:

if ($aborted eq 'quit')

to:

if ($aborted eq 'quit' || $aborted eq 'died')

because we want to auto-dupe jobs that die in Fedora (mainly now because of this annoying bug - https://bugzilla.redhat.com/show_bug.cgi?id=1403343 - which causes our qemu processes to just suddenly crash sometimes). With the original version of coolo's change - where the value of the $quit arg was set by a $aborted eq 'quit' check - this was still OK, but with the current version of the code - where the value of $quit is just hardcoded to 1 inside this block - it causes my problem, because $quit will get set to 1 for both 'died' and 'quit'.

I'll just adjust our downstream patch so it only sets the $quit value to 1 if $aborted was 'quit'...

Actions

Also available in: Atom PDF