action #50225

Make JOB_TIMEOUT incompletes more obvious

Added by leli 11 months ago. Updated 10 months ago.

Status:ResolvedStart date:10/04/2019
Priority:NormalDue date:
Assignee:mkittler% Done:

0%

Category:Feature requests
Target version:Current Sprint
Difficulty:
Duration:

Description

In migration test https://openqa.suse.de/tests/2784731, all modules passed, but test marked as incomplete.
In log autoinst-log.txt we can see debug info for 'unable to inform websocket clients about stopping command server'.
###################################
[2019-04-08T19:29:25.692 CEST] [debug] done with autotest process
[2019-04-08T19:29:25.692 CEST] [debug] killing command server 356459 because test execution ended
[2019-04-08T19:29:25.692 CEST] [debug] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:20123/TzCqocItQD2XDNhg/broadcast
[2019-04-08T19:29:40.707 CEST] [debug] isotovideo: unable to inform websocket clients about stopping command server: Request timeout at /usr/bin/isotovideo line 171.

[2019-04-08T19:29:41.708 CEST] [error] can_read received kill signal at /usr/lib/os-autoinst/myjsonrpc.pm line 91.
#####################################
It stays there too long then the test timeout for more than 2 hours then marked as incomplete.

https://openqa.suse.de/tests/2784731/file/autoinst-log.txt


Related issues

Related to openQA Project - action #49961: [tools][functional][u] Prevent svirt backend to hang on v... Blocked 03/04/2019

History

#1 Updated by leli 11 months ago

Found the same issue on build 212.1. https://openqa.suse.de/tests/2795211

#2 Updated by coolo 11 months ago

  • Project changed from openQA Infrastructure to openQA Project
  • Subject changed from All test modules passed, but test marked as incomplete to Make JOB_TIMEOUT incompletes more obvious
  • Category set to 140
  • Target version set to Ready

Because you ran into the job timeout - after 2 hours the job turns into incomplete. Extend that timeout to fix it - and stop filing bugs under infrastructure unless you are sure it's about infrastructure.

But we should really make that incomplete more obvious - possibly by a different state even.

#3 Updated by leli 11 months ago

coolo wrote:

Because you ran into the job timeout - after 2 hours the job turns into incomplete. Extend that timeout to fix it - and stop filing bugs under infrastructure unless you are sure it's about infrastructure.


But we should really make that incomplete more obvious - possibly by a different state even.

In fact, I think my description already there, this issue finally timeout for more than 2 hours but it stall around 19:29:41 too long (backend got TERM at 20:18:21.423)then timeout for more than 2 hours. All test modules already passed before that and this issue is random happened, I do think backend need to analyze it firstly while extend the timeout to workaround is not the correct direction.

#4 Updated by mkittler 10 months ago

  • Related to action #49961: [tools][functional][u] Prevent svirt backend to hang on virsh undefine command causing job timeouts/incompletes added

#5 Updated by mkittler 10 months ago

The job mentioned in the ticket description timeouts in the same way as the ones mentioned in https://progress.opensuse.org/issues/49961. So all comments I made under that ticket apply here, too.

So @coolo's tip 'Extend that timeout to fix it' wouldn't help here much since it is a command from the svirt backend which hangs at the very end.

Note that normally it is quite obvious that a job incompletes due to the timeout because there are test modules which haven't been executed and the execution time is almost exactly 2 hours. But having a different result (not state) would also make sense.

#6 Updated by mkittler 10 months ago

  • Status changed from New to In Progress
  • Assignee set to mkittler
  • Target version changed from Ready to Current Sprint

#7 Updated by szarate 10 months ago

But we should really make that incomplete more obvious - possibly by a different state even.

A new state would be great, but it might hide underlying problems if the answer is "Oh, yeah... sometimes it's incomplete, just restart." Maybe a record_info would be good enough?

#8 Updated by mkittler 10 months ago

PR: https://github.com/os-autoinst/openQA/pull/2064

I implemented this now as a different result. This is consistent with USER_CANCELLED, USER_RESTARTED and the other "special" incomplete results which fall into the same pattern.

Maybe a record_info would be good enough?

Good enough? That sounds like implementing this as a record_info would be easier. But I don't see how that would be the case. The job is aborted by the worker when the timeout is exceeded and so far adding test artifacts from the worker side is not implemented, right?

#9 Updated by szarate 10 months ago

The are uploaded when the job's timeout is exceeded.

About the record_info, it's just matter of a simpler solution, definitely having a new state helps, however, while it eases the review process, could often cause reviewers to simply dismiss the result, and simply retrigger.

#10 Updated by mkittler 10 months ago

  • Status changed from In Progress to Resolved

The PR for introducing another result has been merged. This should make it obvious enough.

Also available in: Atom PDF