action #50225: Make JOB_TIMEOUT incompletes more obvious - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #50225

closed

Make JOB_TIMEOUT incompletes more obvious

Added by leli over 5 years ago. Updated over 5 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

mkittler

Category:

Feature requests

Target version:

Current Sprint

Start date:

2019-04-10

Due date:

% Done:

Estimated time:

Description

In migration test https://openqa.suse.de/tests/2784731, all modules passed, but test marked as incomplete.
In log autoinst-log.txt we can see debug info for 'unable to inform websocket clients about stopping command server'.
###################################
[2019-04-08T19:29:25.692 CEST] [debug] done with autotest process
[2019-04-08T19:29:25.692 CEST] [debug] killing command server 356459 because test execution ended
[2019-04-08T19:29:25.692 CEST] [debug] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:20123/TzCqocItQD2XDNhg/broadcast
[2019-04-08T19:29:40.707 CEST] [debug] isotovideo: unable to inform websocket clients about stopping command server: Request timeout at /usr/bin/isotovideo line 171.

[2019-04-08T19:29:41.708 CEST] [error] can_read received kill signal at /usr/lib/os-autoinst/myjsonrpc.pm line 91.
#####################################
It stays there too long then the test timeout for more than 2 hours then marked as incomplete.

https://openqa.suse.de/tests/2784731/file/autoinst-log.txt

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by leli over 5 years ago

Found the same issue on build 212.1. https://openqa.suse.de/tests/2795211

Actions

Copy link

Updated by coolo over 5 years ago

Project changed from openQA Infrastructure (public) to openQA Project (public)
Subject changed from All test modules passed, but test marked as incomplete to Make JOB_TIMEOUT incompletes more obvious
Category set to 140
Target version set to Ready

Because you ran into the job timeout - after 2 hours the job turns into incomplete. Extend that timeout to fix it - and stop filing bugs under infrastructure unless you are sure it's about infrastructure.

But we should really make that incomplete more obvious - possibly by a different state even.

Actions

Copy link

Updated by leli over 5 years ago

coolo wrote:

Because you ran into the job timeout - after 2 hours the job turns into incomplete. Extend that timeout to fix it - and stop filing bugs under infrastructure unless you are sure it's about infrastructure.

But we should really make that incomplete more obvious - possibly by a different state even.

In fact, I think my description already there, this issue finally timeout for more than 2 hours but it stall around 19:29:41 too long (backend got TERM at 20:18:21.423)then timeout for more than 2 hours. All test modules already passed before that and this issue is random happened, I do think backend need to analyze it firstly while extend the timeout to workaround is not the correct direction.

Actions

Copy link

Updated by mkittler over 5 years ago

Related to action #49961: Prevent svirt backend to hang on virsh undefine command causing job timeouts/incompletes added

Actions

Copy link

Updated by mkittler over 5 years ago

The job mentioned in the ticket description timeouts in the same way as the ones mentioned in https://progress.opensuse.org/issues/49961. So all comments I made under that ticket apply here, too.

So @coolo's tip 'Extend that timeout to fix it' wouldn't help here much since it is a command from the svirt backend which hangs at the very end.

Note that normally it is quite obvious that a job incompletes due to the timeout because there are test modules which haven't been executed and the execution time is almost exactly 2 hours. But having a different result (not state) would also make sense.

Actions

Copy link

Updated by mkittler over 5 years ago

Status changed from New to In Progress
Assignee set to mkittler
Target version changed from Ready to Current Sprint

Actions

Copy link

Updated by szarate over 5 years ago

But we should really make that incomplete more obvious - possibly by a different state even.

A new state would be great, but it might hide underlying problems if the answer is "Oh, yeah... sometimes it's incomplete, just restart." Maybe a record_info would be good enough?

Actions

Copy link

Updated by mkittler over 5 years ago

PR: https://github.com/os-autoinst/openQA/pull/2064

I implemented this now as a different result. This is consistent with USER_CANCELLED, USER_RESTARTED and the other "special" incomplete results which fall into the same pattern.

Maybe a record_info would be good enough?

Good enough? That sounds like implementing this as a record_info would be easier. But I don't see how that would be the case. The job is aborted by the worker when the timeout is exceeded and so far adding test artifacts from the worker side is not implemented, right?

Actions

Copy link

Updated by szarate over 5 years ago

The are uploaded when the job's timeout is exceeded.

About the record_info, it's just matter of a simpler solution, definitely having a new state helps, however, while it eases the review process, could often cause reviewers to simply dismiss the result, and simply retrigger.

Actions

Copy link

#10

Updated by mkittler over 5 years ago

Status changed from In Progress to Resolved

The PR for introducing another result has been merged. This should make it obvious enough.

Actions

Copy link

#11

Updated by okurz over 4 years ago

Related to coordination #65118: [epic] multimachine test fails with symptoms "websocket refusing connection" and other unclear reasons added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #50225

Make JOB_TIMEOUT incompletes more obvious

Updated by leli over 5 years ago

Updated by coolo over 5 years ago

Updated by leli over 5 years ago

Updated by mkittler over 5 years ago

Updated by mkittler over 5 years ago

Updated by mkittler over 5 years ago

Updated by szarate over 5 years ago

Updated by mkittler over 5 years ago

Updated by szarate over 5 years ago

Updated by mkittler over 5 years ago

Updated by okurz over 4 years ago