Project

General

Profile

action #63718

coordination #39719: [saga][epic] Detect "known failures" and mark jobs as such to make tests more stable, reviewing test results and tracking known issues easier

coordination #62420: [epic] Distinguish all types of incompletes

incomplete reason with just "quit"/"died" could provide more information

Added by okurz 11 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Feature requests
Target version:
Start date:
2020-02-21
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Motivation

We know have "reason" propagated from incomplete jobs but the reason "quit" with no further explanation doesn't tell much. E.g. https://openqa.opensuse.org/tests/1181212#

Acceptance

  • AC1: DONE: Users can understand the consequence of an incomplete job with reason "quit"
  • AC2: like AC1 but for reason "died"

Suggestions

  • Try to simply extend the string that is provided from the worker
  • If there are multiple locations from where we set "quit" we should be able to distinguish these cases from the reason string

Related issues

Related to openQA Project - action #61967: jobs incomplete with auto_review:"qemu-system-ppc64: Failed to allocate KVM HPT of order.*: Cannot allocate memory"Resolved2020-01-09

History

#1 Updated by okurz 11 months ago

  • Description updated (diff)

#2 Updated by okurz 11 months ago

Also see https://github.com/os-autoinst/scripts/pull/20 about that for tracking

#3 Updated by okurz 11 months ago

  • Subject changed from incomplete reason with just "quit" could provide more information to incomplete reason with just "quit"/"died" could provide more information

#4 Updated by okurz 11 months ago

  • Related to action #61967: jobs incomplete with auto_review:"qemu-system-ppc64: Failed to allocate KVM HPT of order.*: Cannot allocate memory" added

#5 Updated by okurz 11 months ago

  • Priority changed from Normal to High
  • Target version set to Current Sprint

As this is one of the most prominent issues recorded as reason for incompletes right now we should handle it with accordingly higher priority, e.g. see https://gitlab.suse.de/openqa/auto-review/-/jobs/175065

#6 Updated by mkittler 10 months ago

The reason "quit" is admittedly not very self-explanatory. However, it has actually only one meaning (the worker received SIGTERM). Although there are no other "quit" reasons we could make it e.g. "quit: associated worker openqaworker-arm-2:17 received SIGTERM". That would be consistent with e.g. "abandoned: associated worker openqaworker-arm-2:17 re-connected but abandoned the job".


Died is a little bit harder. In these cases most likely a problem within isotovideo (e.g. the backend) caused the job to incomplete. Results (up to that point) and logs are usually available. These would provide further information but are usually completely cluttered. So it would be indeed nice to to have the most relevant line from the log added as reason, e.g. died: Error connecting to VNC server <10.161.145.13:5901>: IO::Socket::INET: connect: Connection timed out. Not sure how to achieve that in a generic way, though. Sometimes we might not be able to pin down the problem to a single line anyways or it would just get too long. I suppose it would help if isotovideo (or the subprocess in question) would store its "final fatal error message" at a dedicated place. Of course this fatal error message could as well just be caused by a symptom and the actual problem happened before.

#7 Updated by okurz 10 months ago

mkittler wrote:

The reason "quit" is admittedly not very self-explanatory. However, it has actually only one meaning (the worker received SIGTERM). Although there are no other "quit" reasons we could make it e.g. "quit: associated worker openqaworker-arm-2:17 received SIGTERM". That would be consistent with e.g. "abandoned: associated worker openqaworker-arm-2:17 re-connected but abandoned the job".

Hm, not sure if "associated worker … received SIGTERM" is really that helpful from user perspective. As I user reading this message I would either not understand it or ask "ok, so what?". Rough idea, how about: "quit: associated worker openqaworker-arm-2:17 shut down, job was cloned" ?

Died is a little bit harder. In these cases most likely a problem within isotovideo (e.g. the backend) caused the job to incomplete. Results (up to that point) and logs are usually available. […] Sometimes we might not be able to pin down the problem to a single line anyways or it would just get too long.

Yes, how about just: "died: See log output for details". For this I created a simple https://github.com/os-autoinst/openQA/pull/2854

For "quit" I did not easily understand where I could add more information to the string.

#8 Updated by mkittler 10 months ago

Rough idea, how about: "quit: associated worker openqaworker-arm-2:17 shut down, job was cloned" ?

I don't like the idea.

  1. "job was cloned" is redundant (there's the clone_id in the DB and in the web UI) and maybe not even correct (at the point the reason is recorded we might not whether we will successfully clone that job).
  2. "shut down" is more vague. It doesn't tell us the reason for the shut down, e.g. one might ask whether the worker crashed or why it decided to shut down. If "SIGTERM" is too technical what about "[...] has been stopped or restarted"? The passive makes at least clear that the worker merely reacted but did not initiate it.

Yes, how about just: "died: See log output for details".

I suppose people are used to look into the logs anyways but I like it for consistency.

For "quit" I did not easily understand where I could add more information to the string.

Here's a draft PR: https://github.com/os-autoinst/openQA/pull/2855

#9 Updated by cdywan 10 months ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler

#10 Updated by okurz 10 months ago

https://github.com/os-autoinst/openQA/pull/2854 merged, https://github.com/os-autoinst/openQA/pull/2855 failed on scalability test failure. Either needs another fix or a rebase.

#11 Updated by okurz 10 months ago

  • Status changed from In Progress to Resolved

https://openqa.suse.de/tests/4034740 shows "died: terminated prematurely, see log output for details". That should be good enough.

#12 Updated by okurz 6 months ago

  • Status changed from Resolved to Workable
  • Target version changed from Current Sprint to Ready

actually not good enough and we still track jobs with this ticket, e.g. https://gitlab.suse.de/openqa/auto-review/-/jobs/236945 shows many. SQL query

select jobs.id,t_finished,test,reason,result from jobs,comments where result = 'incomplete' and t_finished >= '2020-07-20' and comments.job_id = jobs.id and text ~ 'poo#63718';

reveals jobs like https://openqa.opensuse.org/tests/1342075# that we already labeled with #69343 but the reason is just "died" here.

#13 Updated by mkittler 6 months ago

  • Status changed from Workable to In Progress

PR to improve some cases: https://github.com/os-autoinst/os-autoinst/pull/1493

I've been using the following SQL query to identify some cases:

select jobs.id,t_started,t_finished,test,reason,result,concat('https://openqa.suse.de/tests/', jobs.id) as url from jobs where result = 'incomplete' and reason = 'died' and t_finished >= '2020-07-20' order by t_finished desc;

With the PR the reason will still just be died if the command server, the backend or autotest crash badly or get killed.

#14 Updated by mkittler 6 months ago

  • Status changed from In Progress to Feedback

The PR has been merged. Let's see what cases are left which it is deployed for a while.

#15 Updated by mkittler 5 months ago

That's the list of jobs incompleted with just died we've still got on OSD since the 1st August:

select jobs.id,t_started,t_finished,test,reason,result,concat('https://openqa.suse.de/tests/', jobs.id) as url from jobs where result = 'incomplete' and reason = 'died' and t_finished >= '2020-08-01' order by t_finished desc;

Here the most interesting cases:

  1. # Test died: script timeout: zypper lifecycle at /usr/lib/os-autoinst/testapi.pm line 1104.: We should likely treat a timeout within script_output as a test failure. So the job should not be incomplete at all.
  2. Could not find 'validate_multipath' in complete history at /usr/lib/os-autoinst/OpenQA/Isotovideo/Utils.pm line 85.: Errors from the Git helpers should likely also be serialized with bmwqemu::serialize_state.
  3. needles_dir not found: /var/lib/openqa/pool/24/os-autoinst-distri-opensuse/os-autoinst-distri-opensuse/products/sle/needles (check vars.json?) at /usr/lib/os-autoinst/needle.pm line 330.: same as 2.

Other failures look identical or there's no autoinst-log.txt anymore so it is hard to tell what the problem was.

Maybe we should also add a general die-handler to isotovideo (and not just the backend).

#16 Updated by mkittler 5 months ago

  • Status changed from Feedback to In Progress

#17 Updated by mkittler 5 months ago

  • Status changed from In Progress to Feedback

The PR for 2. and 3. has been merged. I'm not sure about 1. yet (got no decisive feedback on #testing).

#18 Updated by okurz 5 months ago

  • Description updated (diff)
  • Status changed from Feedback to Workable

IMHO to finish this ticket we should just use the string "died: terminated prematurely, see log output for details" to give a little hint to users, nothing else :)

#20 Updated by cdywan 5 months ago

  • Status changed from Workable to In Progress

#21 Updated by mkittler 5 months ago

  • Status changed from In Progress to Resolved

I wanted to check whether it works on o3. There are actually already 25 jobs with the generic reason which seems to be quite a lot. To be sure the PR didn't break anything I've just did a few local test runs. The PR with the generic reason does not prevent the more specific reasons like "backend died: …" and "tests died: …" to work.

Because I was still curious about these 25 jobs I had a closer look. The jobs I've found on o3 are actually all older than the PR. Apparently we had a generic message at some point before the specific messages were implemented.

I guess I can nevertheless close the issue because we're monitoring incomplete jobs anyways and if there's still a job with just "died" we can simply reopen.

#22 Updated by okurz about 2 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: kde
https://openqa.opensuse.org/tests/1484527

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed

Also available in: Atom PDF