Project

General

Profile

Actions

action #50225

closed

Make JOB_TIMEOUT incompletes more obvious

Added by leli over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2019-04-10
Due date:
% Done:

0%

Estimated time:

Description

In migration test https://openqa.suse.de/tests/2784731, all modules passed, but test marked as incomplete.
In log autoinst-log.txt we can see debug info for 'unable to inform websocket clients about stopping command server'.
###################################
[2019-04-08T19:29:25.692 CEST] [debug] done with autotest process
[2019-04-08T19:29:25.692 CEST] [debug] killing command server 356459 because test execution ended
[2019-04-08T19:29:25.692 CEST] [debug] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:20123/TzCqocItQD2XDNhg/broadcast
[2019-04-08T19:29:40.707 CEST] [debug] isotovideo: unable to inform websocket clients about stopping command server: Request timeout at /usr/bin/isotovideo line 171.

[2019-04-08T19:29:41.708 CEST] [error] can_read received kill signal at /usr/lib/os-autoinst/myjsonrpc.pm line 91.
#####################################
It stays there too long then the test timeout for more than 2 hours then marked as incomplete.

https://openqa.suse.de/tests/2784731/file/autoinst-log.txt


Related issues 2 (0 open2 closed)

Related to openQA Project (public) - action #49961: Prevent svirt backend to hang on virsh undefine command causing job timeouts/incompletesRejectedokurz2019-04-03

Actions
Related to openQA Project (public) - coordination #65118: [epic] multimachine test fails with symptoms "websocket refusing connection" and other unclear reasonsResolvedokurz2020-04-012020-09-30

Actions
Actions #1

Updated by leli over 5 years ago

Found the same issue on build 212.1. https://openqa.suse.de/tests/2795211

Actions #2

Updated by coolo over 5 years ago

  • Project changed from openQA Infrastructure (public) to openQA Project (public)
  • Subject changed from All test modules passed, but test marked as incomplete to Make JOB_TIMEOUT incompletes more obvious
  • Category set to 140
  • Target version set to Ready

Because you ran into the job timeout - after 2 hours the job turns into incomplete. Extend that timeout to fix it - and stop filing bugs under infrastructure unless you are sure it's about infrastructure.

But we should really make that incomplete more obvious - possibly by a different state even.

Actions #3

Updated by leli over 5 years ago

coolo wrote:

Because you ran into the job timeout - after 2 hours the job turns into incomplete. Extend that timeout to fix it - and stop filing bugs under infrastructure unless you are sure it's about infrastructure.

But we should really make that incomplete more obvious - possibly by a different state even.

In fact, I think my description already there, this issue finally timeout for more than 2 hours but it stall around 19:29:41 too long (backend got TERM at 20:18:21.423)then timeout for more than 2 hours. All test modules already passed before that and this issue is random happened, I do think backend need to analyze it firstly while extend the timeout to workaround is not the correct direction.

Actions #4

Updated by mkittler over 5 years ago

  • Related to action #49961: Prevent svirt backend to hang on virsh undefine command causing job timeouts/incompletes added
Actions #5

Updated by mkittler over 5 years ago

The job mentioned in the ticket description timeouts in the same way as the ones mentioned in https://progress.opensuse.org/issues/49961. So all comments I made under that ticket apply here, too.

So @coolo's tip 'Extend that timeout to fix it' wouldn't help here much since it is a command from the svirt backend which hangs at the very end.

Note that normally it is quite obvious that a job incompletes due to the timeout because there are test modules which haven't been executed and the execution time is almost exactly 2 hours. But having a different result (not state) would also make sense.

Actions #6

Updated by mkittler over 5 years ago

  • Status changed from New to In Progress
  • Assignee set to mkittler
  • Target version changed from Ready to Current Sprint
Actions #7

Updated by szarate over 5 years ago

But we should really make that incomplete more obvious - possibly by a different state even.

A new state would be great, but it might hide underlying problems if the answer is "Oh, yeah... sometimes it's incomplete, just restart." Maybe a record_info would be good enough?

Actions #8

Updated by mkittler over 5 years ago

PR: https://github.com/os-autoinst/openQA/pull/2064

I implemented this now as a different result. This is consistent with USER_CANCELLED, USER_RESTARTED and the other "special" incomplete results which fall into the same pattern.

Maybe a record_info would be good enough?

Good enough? That sounds like implementing this as a record_info would be easier. But I don't see how that would be the case. The job is aborted by the worker when the timeout is exceeded and so far adding test artifacts from the worker side is not implemented, right?

Actions #9

Updated by szarate over 5 years ago

The are uploaded when the job's timeout is exceeded.

About the record_info, it's just matter of a simpler solution, definitely having a new state helps, however, while it eases the review process, could often cause reviewers to simply dismiss the result, and simply retrigger.

Actions #10

Updated by mkittler over 5 years ago

  • Status changed from In Progress to Resolved

The PR for introducing another result has been merged. This should make it obvious enough.

Actions #11

Updated by okurz over 4 years ago

  • Related to coordination #65118: [epic] multimachine test fails with symptoms "websocket refusing connection" and other unclear reasons added
Actions

Also available in: Atom PDF