action #63718: incomplete reason with just "quit"/"died" could provide more information - openQA Project (public) - openSUSE Project Management Tool

Actions

action #63718

closed

coordination #39719: [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues

coordination #62420: [epic] Distinguish all types of incompletes

incomplete reason with just "quit"/"died" could provide more information

Added by okurz about 5 years ago. Updated over 4 years ago.

Status:

Resolved

Priority:

High

Assignee:

mkittler

Category:

Feature requests

Target version:

Ready

Start date:

2020-02-21

Due date:

% Done:

Estimated time:

Description

Motivation¶

We know have "reason" propagated from incomplete jobs but the reason "quit" with no further explanation doesn't tell much. E.g. https://openqa.opensuse.org/tests/1181212#

Acceptance¶

AC1: DONE: Users can understand the consequence of an incomplete job with reason "quit"
AC2: like AC1 but for reason "died"

Suggestions¶

Try to simply extend the string that is provided from the worker
If there are multiple locations from where we set "quit" we should be able to distinguish these cases from the reason string

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by okurz about 5 years ago

Description updated (diff)

Actions

Copy link

Updated by okurz about 5 years ago

Also see https://github.com/os-autoinst/scripts/pull/20 about that for tracking

Actions

Copy link

Updated by okurz about 5 years ago

Subject changed from incomplete reason with just "quit" could provide more information to incomplete reason with just "quit"/"died" could provide more information

Actions

Copy link

Updated by okurz about 5 years ago

Related to action #61967: jobs incomplete with auto_review:"qemu-system-ppc64: Failed to allocate KVM HPT of order.*: Cannot allocate memory" added

Actions

Copy link

Updated by okurz about 5 years ago

Priority changed from Normal to High
Target version set to Current Sprint

As this is one of the most prominent issues recorded as reason for incompletes right now we should handle it with accordingly higher priority, e.g. see https://gitlab.suse.de/openqa/auto-review/-/jobs/175065

Actions

Copy link

Updated by mkittler about 5 years ago

The reason "quit" is admittedly not very self-explanatory. However, it has actually only one meaning (the worker received SIGTERM). Although there are no other "quit" reasons we could make it e.g. "quit: associated worker openqaworker-arm-2:17 received SIGTERM". That would be consistent with e.g. "abandoned: associated worker openqaworker-arm-2:17 re-connected but abandoned the job".

Died is a little bit harder. In these cases most likely a problem within isotovideo (e.g. the backend) caused the job to incomplete. Results (up to that point) and logs are usually available. These would provide further information but are usually completely cluttered. So it would be indeed nice to to have the most relevant line from the log added as reason, e.g. died: Error connecting to VNC server <10.161.145.13:5901>: IO::Socket::INET: connect: Connection timed out. Not sure how to achieve that in a generic way, though. Sometimes we might not be able to pin down the problem to a single line anyways or it would just get too long. I suppose it would help if isotovideo (or the subprocess in question) would store its "final fatal error message" at a dedicated place. Of course this fatal error message could as well just be caused by a symptom and the actual problem happened before.

Actions

Copy link

Updated by okurz about 5 years ago

mkittler wrote:

The reason "quit" is admittedly not very self-explanatory. However, it has actually only one meaning (the worker received SIGTERM). Although there are no other "quit" reasons we could make it e.g. "quit: associated worker openqaworker-arm-2:17 received SIGTERM". That would be consistent with e.g. "abandoned: associated worker openqaworker-arm-2:17 re-connected but abandoned the job".

Hm, not sure if "associated worker … received SIGTERM" is really that helpful from user perspective. As I user reading this message I would either not understand it or ask "ok, so what?". Rough idea, how about: "quit: associated worker openqaworker-arm-2:17 shut down, job was cloned" ?

Died is a little bit harder. In these cases most likely a problem within isotovideo (e.g. the backend) caused the job to incomplete. Results (up to that point) and logs are usually available. […] Sometimes we might not be able to pin down the problem to a single line anyways or it would just get too long.

Yes, how about just: "died: See log output for details". For this I created a simple https://github.com/os-autoinst/openQA/pull/2854

For "quit" I did not easily understand where I could add more information to the string.

Actions

Copy link

Updated by mkittler about 5 years ago

Rough idea, how about: "quit: associated worker openqaworker-arm-2:17 shut down, job was cloned" ?

I don't like the idea.

"job was cloned" is redundant (there's the clone_id in the DB and in the web UI) and maybe not even correct (at the point the reason is recorded we might not whether we will successfully clone that job).
"shut down" is more vague. It doesn't tell us the reason for the shut down, e.g. one might ask whether the worker crashed or why it decided to shut down. If "SIGTERM" is too technical what about "[...] has been stopped or restarted"? The passive makes at least clear that the worker merely reacted but did not initiate it.

Yes, how about just: "died: See log output for details".

I suppose people are used to look into the logs anyways but I like it for consistency.

For "quit" I did not easily understand where I could add more information to the string.

Here's a draft PR: https://github.com/os-autoinst/openQA/pull/2855

Actions

Copy link

Updated by livdywan about 5 years ago

Status changed from Workable to In Progress
Assignee set to mkittler

Actions

Copy link

#10

Updated by okurz about 5 years ago

https://github.com/os-autoinst/openQA/pull/2854 merged, https://github.com/os-autoinst/openQA/pull/2855 failed on scalability test failure. Either needs another fix or a rebase.

Actions

Copy link

#11

Updated by okurz about 5 years ago

Status changed from In Progress to Resolved

https://openqa.suse.de/tests/4034740 shows "died: terminated prematurely, see log output for details". That should be good enough.

Actions

Copy link

#12

Updated by okurz over 4 years ago

Status changed from Resolved to Workable
Target version changed from Current Sprint to Ready

actually not good enough and we still track jobs with this ticket, e.g. https://gitlab.suse.de/openqa/auto-review/-/jobs/236945 shows many. SQL query

select jobs.id,t_finished,test,reason,result from jobs,comments where result = 'incomplete' and t_finished >= '2020-07-20' and comments.job_id = jobs.id and text ~ 'poo#63718';

reveals jobs like https://openqa.opensuse.org/tests/1342075# that we already labeled with #69343 but the reason is just "died" here.

Actions

Copy link

#13

Updated by mkittler over 4 years ago

Status changed from Workable to In Progress

PR to improve some cases: https://github.com/os-autoinst/os-autoinst/pull/1493

I've been using the following SQL query to identify some cases:

select jobs.id,t_started,t_finished,test,reason,result,concat('https://openqa.suse.de/tests/', jobs.id) as url from jobs where result = 'incomplete' and reason = 'died' and t_finished >= '2020-07-20' order by t_finished desc;

With the PR the reason will still just be died if the command server, the backend or autotest crash badly or get killed.

Actions

Copy link

#14

Updated by mkittler over 4 years ago

Status changed from In Progress to Feedback

The PR has been merged. Let's see what cases are left which it is deployed for a while.

Actions

Copy link

#15

Updated by mkittler over 4 years ago

That's the list of jobs incompleted with just died we've still got on OSD since the 1st August:

select jobs.id,t_started,t_finished,test,reason,result,concat('https://openqa.suse.de/tests/', jobs.id) as url from jobs where result = 'incomplete' and reason = 'died' and t_finished >= '2020-08-01' order by t_finished desc;

Here the most interesting cases:

# Test died: script timeout: zypper lifecycle at /usr/lib/os-autoinst/testapi.pm line 1104.: We should likely treat a timeout within script_output as a test failure. So the job should not be incomplete at all.
Could not find 'validate_multipath' in complete history at /usr/lib/os-autoinst/OpenQA/Isotovideo/Utils.pm line 85.: Errors from the Git helpers should likely also be serialized with bmwqemu::serialize_state.
needles_dir not found: /var/lib/openqa/pool/24/os-autoinst-distri-opensuse/os-autoinst-distri-opensuse/products/sle/needles (check vars.json?) at /usr/lib/os-autoinst/needle.pm line 330.: same as 2.

Other failures look identical or there's no autoinst-log.txt anymore so it is hard to tell what the problem was.

Maybe we should also add a general die-handler to isotovideo (and not just the backend).

Actions

Copy link

#16

Updated by mkittler over 4 years ago

Status changed from Feedback to In Progress

PR for 2. and 3.: https://github.com/os-autoinst/os-autoinst/pull/1515

Actions

Copy link

#17

Updated by mkittler over 4 years ago

Status changed from In Progress to Feedback

The PR for 2. and 3. has been merged. I'm not sure about 1. yet (got no decisive feedback on #testing).

Actions

Copy link

#18

Updated by okurz over 4 years ago

Description updated (diff)
Status changed from Feedback to Workable

IMHO to finish this ticket we should just use the string "died: terminated prematurely, see log output for details" to give a little hint to users, nothing else :)

Actions

Copy link

#19

Updated by mkittler over 4 years ago

PR: https://github.com/os-autoinst/openQA/pull/3359

Actions

Copy link

#20

Updated by livdywan over 4 years ago

Status changed from Workable to In Progress

Actions

Copy link

#21

Updated by mkittler over 4 years ago

Status changed from In Progress to Resolved

I wanted to check whether it works on o3. There are actually already 25 jobs with the generic reason which seems to be quite a lot. To be sure the PR didn't break anything I've just did a few local test runs. The PR with the generic reason does not prevent the more specific reasons like "backend died: …" and "tests died: …" to work.

Because I was still curious about these 25 jobs I had a closer look. The jobs I've found on o3 are actually all older than the PR. Apparently we had a generic message at some point before the specific messages were implemented.

I guess I can nevertheless close the issue because we're monitoring incomplete jobs anyways and if there's still a job with just "died" we can simply reopen.

Actions

Copy link

#22

Updated by okurz over 4 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: kde
https://openqa.opensuse.org/tests/1484527

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released"
The label in the openQA scenario is removed

Actions

Copy link

#23

Updated by tinita about 2 years ago

Related to action #124934: os-autoinst: Misleading error message in the openQA info panel reason size:M added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #63718

incomplete reason with just "quit"/"died" could provide more information

Motivation¶

Acceptance¶

Suggestions¶

Updated by okurz about 5 years ago

Updated by okurz about 5 years ago

Updated by okurz about 5 years ago

Updated by okurz about 5 years ago

Updated by okurz about 5 years ago

Updated by mkittler about 5 years ago

Updated by okurz about 5 years ago

Updated by mkittler about 5 years ago

Updated by livdywan about 5 years ago

Updated by okurz about 5 years ago

Updated by okurz about 5 years ago

Updated by okurz over 4 years ago

Updated by mkittler over 4 years ago

Updated by mkittler over 4 years ago

Updated by mkittler over 4 years ago

Updated by mkittler over 4 years ago

Updated by mkittler over 4 years ago

Updated by okurz over 4 years ago

Updated by mkittler over 4 years ago

Updated by livdywan over 4 years ago

Updated by mkittler over 4 years ago

Updated by okurz over 4 years ago

Updated by tinita about 2 years ago