action #63718
closedcoordination #39719: [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues
coordination #62420: [epic] Distinguish all types of incompletes
incomplete reason with just "quit"/"died" could provide more information
Description
Motivation¶
We know have "reason" propagated from incomplete jobs but the reason "quit" with no further explanation doesn't tell much. E.g. https://openqa.opensuse.org/tests/1181212#
Acceptance¶
- AC1: DONE: Users can understand the consequence of an incomplete job with reason "quit"
- AC2: like AC1 but for reason "died"
Suggestions¶
- Try to simply extend the string that is provided from the worker
- If there are multiple locations from where we set "quit" we should be able to distinguish these cases from the reason string
Updated by okurz over 4 years ago
Also see https://github.com/os-autoinst/scripts/pull/20 about that for tracking
Updated by okurz over 4 years ago
- Subject changed from incomplete reason with just "quit" could provide more information to incomplete reason with just "quit"/"died" could provide more information
Updated by okurz over 4 years ago
- Related to action #61967: jobs incomplete with auto_review:"qemu-system-ppc64: Failed to allocate KVM HPT of order.*: Cannot allocate memory" added
Updated by okurz over 4 years ago
- Priority changed from Normal to High
- Target version set to Current Sprint
As this is one of the most prominent issues recorded as reason for incompletes right now we should handle it with accordingly higher priority, e.g. see https://gitlab.suse.de/openqa/auto-review/-/jobs/175065
Updated by mkittler over 4 years ago
The reason "quit" is admittedly not very self-explanatory. However, it has actually only one meaning (the worker received SIGTERM). Although there are no other "quit" reasons we could make it e.g. "quit: associated worker openqaworker-arm-2:17 received SIGTERM". That would be consistent with e.g. "abandoned: associated worker openqaworker-arm-2:17 re-connected but abandoned the job".
Died is a little bit harder. In these cases most likely a problem within isotovideo (e.g. the backend) caused the job to incomplete. Results (up to that point) and logs are usually available. These would provide further information but are usually completely cluttered. So it would be indeed nice to to have the most relevant line from the log added as reason, e.g. died: Error connecting to VNC server <10.161.145.13:5901>: IO::Socket::INET: connect: Connection timed out
. Not sure how to achieve that in a generic way, though. Sometimes we might not be able to pin down the problem to a single line anyways or it would just get too long. I suppose it would help if isotovideo (or the subprocess in question) would store its "final fatal error message" at a dedicated place. Of course this fatal error message could as well just be caused by a symptom and the actual problem happened before.
Updated by okurz over 4 years ago
mkittler wrote:
The reason "quit" is admittedly not very self-explanatory. However, it has actually only one meaning (the worker received SIGTERM). Although there are no other "quit" reasons we could make it e.g. "quit: associated worker openqaworker-arm-2:17 received SIGTERM". That would be consistent with e.g. "abandoned: associated worker openqaworker-arm-2:17 re-connected but abandoned the job".
Hm, not sure if "associated worker … received SIGTERM" is really that helpful from user perspective. As I user reading this message I would either not understand it or ask "ok, so what?". Rough idea, how about: "quit: associated worker openqaworker-arm-2:17 shut down, job was cloned" ?
Died is a little bit harder. In these cases most likely a problem within isotovideo (e.g. the backend) caused the job to incomplete. Results (up to that point) and logs are usually available. […] Sometimes we might not be able to pin down the problem to a single line anyways or it would just get too long.
Yes, how about just: "died: See log output for details". For this I created a simple https://github.com/os-autoinst/openQA/pull/2854
For "quit" I did not easily understand where I could add more information to the string.
Updated by mkittler over 4 years ago
Rough idea, how about: "quit: associated worker openqaworker-arm-2:17 shut down, job was cloned" ?
I don't like the idea.
- "job was cloned" is redundant (there's the
clone_id
in the DB and in the web UI) and maybe not even correct (at the point the reason is recorded we might not whether we will successfully clone that job). - "shut down" is more vague. It doesn't tell us the reason for the shut down, e.g. one might ask whether the worker crashed or why it decided to shut down. If "SIGTERM" is too technical what about "[...] has been stopped or restarted"? The passive makes at least clear that the worker merely reacted but did not initiate it.
Yes, how about just: "died: See log output for details".
I suppose people are used to look into the logs anyways but I like it for consistency.
For "quit" I did not easily understand where I could add more information to the string.
Here's a draft PR: https://github.com/os-autoinst/openQA/pull/2855
Updated by livdywan over 4 years ago
- Status changed from Workable to In Progress
- Assignee set to mkittler
Updated by okurz over 4 years ago
https://github.com/os-autoinst/openQA/pull/2854 merged, https://github.com/os-autoinst/openQA/pull/2855 failed on scalability test failure. Either needs another fix or a rebase.
Updated by okurz over 4 years ago
- Status changed from In Progress to Resolved
https://openqa.suse.de/tests/4034740 shows "died: terminated prematurely, see log output for details". That should be good enough.
Updated by okurz about 4 years ago
- Status changed from Resolved to Workable
- Target version changed from Current Sprint to Ready
actually not good enough and we still track jobs with this ticket, e.g. https://gitlab.suse.de/openqa/auto-review/-/jobs/236945 shows many. SQL query
select jobs.id,t_finished,test,reason,result from jobs,comments where result = 'incomplete' and t_finished >= '2020-07-20' and comments.job_id = jobs.id and text ~ 'poo#63718';
reveals jobs like https://openqa.opensuse.org/tests/1342075# that we already labeled with #69343 but the reason is just "died" here.
Updated by mkittler about 4 years ago
- Status changed from Workable to In Progress
PR to improve some cases: https://github.com/os-autoinst/os-autoinst/pull/1493
I've been using the following SQL query to identify some cases:
select jobs.id,t_started,t_finished,test,reason,result,concat('https://openqa.suse.de/tests/', jobs.id) as url from jobs where result = 'incomplete' and reason = 'died' and t_finished >= '2020-07-20' order by t_finished desc;
With the PR the reason will still just be died
if the command server, the backend or autotest crash badly or get killed.
Updated by mkittler about 4 years ago
- Status changed from In Progress to Feedback
The PR has been merged. Let's see what cases are left which it is deployed for a while.
Updated by mkittler about 4 years ago
That's the list of jobs incompleted with just died
we've still got on OSD since the 1st August:
select jobs.id,t_started,t_finished,test,reason,result,concat('https://openqa.suse.de/tests/', jobs.id) as url from jobs where result = 'incomplete' and reason = 'died' and t_finished >= '2020-08-01' order by t_finished desc;
Here the most interesting cases:
# Test died: script timeout: zypper lifecycle at /usr/lib/os-autoinst/testapi.pm line 1104.
: We should likely treat a timeout withinscript_output
as a test failure. So the job should not be incomplete at all.Could not find 'validate_multipath' in complete history at /usr/lib/os-autoinst/OpenQA/Isotovideo/Utils.pm line 85.
: Errors from the Git helpers should likely also be serialized withbmwqemu::serialize_state
.needles_dir not found: /var/lib/openqa/pool/24/os-autoinst-distri-opensuse/os-autoinst-distri-opensuse/products/sle/needles (check vars.json?) at /usr/lib/os-autoinst/needle.pm line 330.
: same as 2.
Other failures look identical or there's no autoinst-log.txt anymore so it is hard to tell what the problem was.
Maybe we should also add a general die-handler to isotovideo (and not just the backend).
Updated by mkittler about 4 years ago
- Status changed from Feedback to In Progress
PR for 2. and 3.: https://github.com/os-autoinst/os-autoinst/pull/1515
Updated by mkittler about 4 years ago
- Status changed from In Progress to Feedback
The PR for 2. and 3. has been merged. I'm not sure about 1. yet (got no decisive feedback on #testing).
Updated by okurz about 4 years ago
- Description updated (diff)
- Status changed from Feedback to Workable
IMHO to finish this ticket we should just use the string "died: terminated prematurely, see log output for details" to give a little hint to users, nothing else :)
Updated by mkittler about 4 years ago
Updated by livdywan about 4 years ago
- Status changed from Workable to In Progress
Updated by mkittler about 4 years ago
- Status changed from In Progress to Resolved
I wanted to check whether it works on o3. There are actually already 25 jobs with the generic reason which seems to be quite a lot. To be sure the PR didn't break anything I've just did a few local test runs. The PR with the generic reason does not prevent the more specific reasons like "backend died: …" and "tests died: …" to work.
Because I was still curious about these 25 jobs I had a closer look. The jobs I've found on o3 are actually all older than the PR. Apparently we had a generic message at some point before the specific messages were implemented.
I guess I can nevertheless close the issue because we're monitoring incomplete jobs anyways and if there's still a job with just "died" we can simply reopen.
Updated by okurz almost 4 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: kde
https://openqa.opensuse.org/tests/1484527
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released"
- The label in the openQA scenario is removed
Updated by tinita over 1 year ago
- Related to action #124934: os-autoinst: Misleading error message in the openQA info panel reason size:M added