action #50225
closedMake JOB_TIMEOUT incompletes more obvious
0%
Description
In migration test https://openqa.suse.de/tests/2784731, all modules passed, but test marked as incomplete.
In log autoinst-log.txt we can see debug info for 'unable to inform websocket clients about stopping command server'.
###################################
[2019-04-08T19:29:25.692 CEST] [debug] done with autotest process
[2019-04-08T19:29:25.692 CEST] [debug] killing command server 356459 because test execution ended
[2019-04-08T19:29:25.692 CEST] [debug] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:20123/TzCqocItQD2XDNhg/broadcast
[2019-04-08T19:29:40.707 CEST] [debug] isotovideo: unable to inform websocket clients about stopping command server: Request timeout at /usr/bin/isotovideo line 171.
[2019-04-08T19:29:41.708 CEST] [error] can_read received kill signal at /usr/lib/os-autoinst/myjsonrpc.pm line 91.
#####################################
It stays there too long then the test timeout for more than 2 hours then marked as incomplete.
Updated by leli over 5 years ago
Found the same issue on build 212.1. https://openqa.suse.de/tests/2795211
Updated by coolo over 5 years ago
- Project changed from openQA Infrastructure (public) to openQA Project (public)
- Subject changed from All test modules passed, but test marked as incomplete to Make JOB_TIMEOUT incompletes more obvious
- Category set to 140
- Target version set to Ready
Because you ran into the job timeout - after 2 hours the job turns into incomplete. Extend that timeout to fix it - and stop filing bugs under infrastructure unless you are sure it's about infrastructure.
But we should really make that incomplete more obvious - possibly by a different state even.
Updated by leli over 5 years ago
coolo wrote:
Because you ran into the job timeout - after 2 hours the job turns into incomplete. Extend that timeout to fix it - and stop filing bugs under infrastructure unless you are sure it's about infrastructure.
But we should really make that incomplete more obvious - possibly by a different state even.
In fact, I think my description already there, this issue finally timeout for more than 2 hours but it stall around 19:29:41 too long (backend got TERM at 20:18:21.423)then timeout for more than 2 hours. All test modules already passed before that and this issue is random happened, I do think backend need to analyze it firstly while extend the timeout to workaround is not the correct direction.
Updated by mkittler over 5 years ago
- Related to action #49961: Prevent svirt backend to hang on virsh undefine command causing job timeouts/incompletes added
Updated by mkittler over 5 years ago
The job mentioned in the ticket description timeouts in the same way as the ones mentioned in https://progress.opensuse.org/issues/49961. So all comments I made under that ticket apply here, too.
So @coolo's tip 'Extend that timeout to fix it' wouldn't help here much since it is a command from the svirt backend which hangs at the very end.
Note that normally it is quite obvious that a job incompletes due to the timeout because there are test modules which haven't been executed and the execution time is almost exactly 2 hours. But having a different result (not state) would also make sense.
Updated by mkittler over 5 years ago
- Status changed from New to In Progress
- Assignee set to mkittler
- Target version changed from Ready to Current Sprint
Updated by szarate over 5 years ago
But we should really make that incomplete more obvious - possibly by a different state even.
A new state would be great, but it might hide underlying problems if the answer is "Oh, yeah... sometimes it's incomplete, just restart." Maybe a record_info would be good enough?
Updated by mkittler over 5 years ago
PR: https://github.com/os-autoinst/openQA/pull/2064
I implemented this now as a different result. This is consistent with USER_CANCELLED
, USER_RESTARTED
and the other "special" incomplete results which fall into the same pattern.
Maybe a record_info would be good enough?
Good enough? That sounds like implementing this as a record_info
would be easier. But I don't see how that would be the case. The job is aborted by the worker when the timeout is exceeded and so far adding test artifacts from the worker side is not implemented, right?
Updated by szarate over 5 years ago
The are uploaded when the job's timeout is exceeded.
About the record_info, it's just matter of a simpler solution, definitely having a new state helps, however, while it eases the review process, could often cause reviewers to simply dismiss the result, and simply retrigger.
Updated by mkittler over 5 years ago
- Status changed from In Progress to Resolved
The PR for introducing another result has been merged. This should make it obvious enough.
Updated by okurz over 4 years ago
- Related to coordination #65118: [epic] multimachine test fails with symptoms "websocket refusing connection" and other unclear reasons added