action #13042: Tests that run over MAX_JOB_TIME should fail - openQA Project (public) - openSUSE Project Management Tool

Custom queries

All 'new' issues w/o assignee, sorted by version/priority
All auto_review tickets
All auto_review+force_result tickets
openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE Tools Team - Beginner
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE Tools team - due soon
QE tools team - exceeding due-date
QE Tools Team - Expert
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

Copy link

action #13042

closed

Tests that run over MAX_JOB_TIME should fail

Added by coolo over 8 years ago. Updated over 8 years ago.

Status:

Resolved

Priority:

High

Assignee:

okurz

Category:

Target version:

Start date:

2016-08-06

Due date:

% Done:

Estimated time:

Description

Right now these jobs are incomplete and re-started forever. This makes it very hard to spot the problem

History
Notes
Property changes

Actions

Copy link

Updated by okurz over 8 years ago

I guess the problem is again who is responsible for INCOMPLETE jobs, who is responsible for FAILED jobs. For me it is always coming back to the question if we should really auto-restart incomplete jobs. At least as long as we don't have a good view of incomplete jobs that an admin can work through I don't think it is good to have this auto-restart. It is also very annoying for local test development where the cause is obvious (test developer did something wrong). I don't agree in general that tests that run over the time limit should fail. I can only see two reasons for MAX_JOB_TIMEOUT hit:

When the lower level timeouts of all testapi calls sum up to more than MAX_JOB_TIMEOUT --> test developer should either lower testapi call timeouts or adjust MAX_JOB_TIMEOUT
Something is hanging in test execution --> test developer did something really weird or backend has bug

Both cases mean that the product is not to blame, therefore the test should not complete with FAILED.

Can we just make INCOMPLETE not auto-start again? If you want some high-level auto-retrigger magic, we should do this only based on known patterns as the jenkins plugin for known failure causes can also do.

Actions

Copy link

Updated by coolo over 8 years ago

We can try. I can't really judge the pros of restarting, but right now I know several cons ;(

Actions

Copy link

Updated by coolo over 8 years ago

If the webapi failed him, the worker is right in duplicating the job though.

Actions

Copy link

Updated by okurz over 8 years ago

yes, that's certainly true. Means we can still improve our monitoring of error messages in webservers logfiles but the worker should retry if it's not his fault.

Actions

Copy link

Updated by okurz over 8 years ago

Status changed from New to In Progress
Assignee set to okurz

https://github.com/os-autoinst/openQA/pull/828

Actions

Copy link

Updated by okurz over 8 years ago

Status changed from In Progress to Resolved

PR merged and deployed

Actions

Copy link

Updated by AdamWill over 8 years ago

For the record, I'm gonna revert this for Fedora, at least for now. I'd really like our tests to be comparable day-to-day as best as possible, and we still get some just 'random crap' incomplete jobs...just today I have one where a video upload failed for no readily discernable reason:

Sep 11 02:48:00 qa06.qa.fedoraproject.org worker[31095]: ERROR video.ogv: Connection error: Premature connection close

and we still get a few tests like this one:

https://openqa.stg.fedoraproject.org/tests/41821/file/autoinst-log.txt

where for some reason the worker process just fails to connect to the qemu VNC server. And I've had a couple of cases where asset upload hit a 403 for some bizarre reason. I don't have time to try and figure out all these bizarre things for now; auto-duplication does a decent job of making them not really a problem.

Actions

Copy link

Updated by AdamWill over 8 years ago

For a compromise, it should probably be pretty easy to only restart 'mysteriously incomplete' jobs once (or X number of times), instead of doing it infinitely if they keep hitting that state.

Actions

Copy link

Updated by okurz over 8 years ago

My personal opinion is that no error should be (silently) ignored and the problem we have is that restarting an incomplete job goes unnoticed. My vision is to have something like the "retry on build failures" in jenkins which retries but logs this a bit more obviously. My current proposed approach is to detect "incompletes" from outside, e.g. based on your python-openqa_client, find all incompletes, gather why they failed, if for known reason, retrigger.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #13042

Tests that run over MAX_JOB_TIME should fail

Updated by okurz over 8 years ago

Updated by coolo over 8 years ago

Updated by coolo over 8 years ago

Updated by okurz over 8 years ago

Updated by okurz over 8 years ago

Updated by okurz over 8 years ago

Updated by AdamWill over 8 years ago

Updated by AdamWill over 8 years ago

Updated by okurz over 8 years ago