Project

General

Profile

Actions

action #13042

closed

Tests that run over MAX_JOB_TIME should fail

Added by coolo over 8 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
Start date:
2016-08-06
Due date:
% Done:

0%

Estimated time:

Description

Right now these jobs are incomplete and re-started forever. This makes it very hard to spot the problem

Actions #1

Updated by okurz over 8 years ago

I guess the problem is again who is responsible for INCOMPLETE jobs, who is responsible for FAILED jobs. For me it is always coming back to the question if we should really auto-restart incomplete jobs. At least as long as we don't have a good view of incomplete jobs that an admin can work through I don't think it is good to have this auto-restart. It is also very annoying for local test development where the cause is obvious (test developer did something wrong). I don't agree in general that tests that run over the time limit should fail. I can only see two reasons for MAX_JOB_TIMEOUT hit:

  • When the lower level timeouts of all testapi calls sum up to more than MAX_JOB_TIMEOUT --> test developer should either lower testapi call timeouts or adjust MAX_JOB_TIMEOUT
  • Something is hanging in test execution --> test developer did something really weird or backend has bug

Both cases mean that the product is not to blame, therefore the test should not complete with FAILED.

Can we just make INCOMPLETE not auto-start again? If you want some high-level auto-retrigger magic, we should do this only based on known patterns as the jenkins plugin for known failure causes can also do.

Actions #2

Updated by coolo over 8 years ago

We can try. I can't really judge the pros of restarting, but right now I know several cons ;(

Actions #3

Updated by coolo over 8 years ago

If the webapi failed him, the worker is right in duplicating the job though.

Actions #4

Updated by okurz over 8 years ago

yes, that's certainly true. Means we can still improve our monitoring of error messages in webservers logfiles but the worker should retry if it's not his fault.

Actions #5

Updated by okurz over 8 years ago

  • Status changed from New to In Progress
  • Assignee set to okurz
Actions #6

Updated by okurz over 8 years ago

  • Status changed from In Progress to Resolved

PR merged and deployed

Actions #7

Updated by AdamWill over 8 years ago

For the record, I'm gonna revert this for Fedora, at least for now. I'd really like our tests to be comparable day-to-day as best as possible, and we still get some just 'random crap' incomplete jobs...just today I have one where a video upload failed for no readily discernable reason:

Sep 11 02:48:00 qa06.qa.fedoraproject.org worker[31095]: ERROR video.ogv: Connection error: Premature connection close

and we still get a few tests like this one:

https://openqa.stg.fedoraproject.org/tests/41821/file/autoinst-log.txt

where for some reason the worker process just fails to connect to the qemu VNC server. And I've had a couple of cases where asset upload hit a 403 for some bizarre reason. I don't have time to try and figure out all these bizarre things for now; auto-duplication does a decent job of making them not really a problem.

Actions #8

Updated by AdamWill over 8 years ago

For a compromise, it should probably be pretty easy to only restart 'mysteriously incomplete' jobs once (or X number of times), instead of doing it infinitely if they keep hitting that state.

Actions #9

Updated by okurz over 8 years ago

My personal opinion is that no error should be (silently) ignored and the problem we have is that restarting an incomplete job goes unnoticed. My vision is to have something like the "retry on build failures" in jenkins which retries but logs this a bit more obviously. My current proposed approach is to detect "incompletes" from outside, e.g. based on your python-openqa_client, find all incompletes, gather why they failed, if for known reason, retrigger.

Actions

Also available in: Atom PDF