Project

General

Profile

Actions

coordination #69451

open

coordination #102906: [saga][epic] Increased stability of tests with less "known failures", known incompletes handled automatically within openQA

coordination #102909: [epic] Prevent more incompletes already within os-autoinst or openQA

[epic] test incompletes with "(?s)Download.*successful.*Failed to download":retry, not helpful details

Added by okurz over 4 years ago. Updated almost 3 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Regressions/Crashes
Target version:
Start date:
2020-07-30
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://openqa.suse.de/tests/4504422 is incomplete with reason "setup failure: Failed to download SLEHPC-15-SP2-aarch64-GM-SERVER-gnome.qcow2 to /var/lib/openqa/cache/openqa.suse.de/SLEHPC-15-SP2-aarch64-GM-SERVER-gnome.qcow2" so no explanation of the real reason in there. The log content of autoinst-log.txt as shown on the details page shows:

[info] [#2427] Download of "/var/lib/openqa/cache/openqa.suse.de/SLEHPC-15-SP2-aarch64-GM-SERVER-gnome.qcow2" successful, new cache size is 50GiB
…
[2020-07-30T04:39:00.0302 CEST] [error] [pid:16217] Failed to download SLEHPC-15-SP2-aarch64-GM-SERVER-gnome.qcow2 to /var/lib/openqa/cache/openqa.suse.de/SLEHPC-15-SP2-aarch64-GM-SERVER-gnome.qcow2

which is certainly weird and not expected.

Acceptance criteria

  • AC1: No "download successful" should end with "Failed to download" referencing the same asset

Suggestions

  • Compare to #69448
  • Research what the actual error could be and add it as a "likely explanation" in the existing error string
  • Try to include the actual error details into the affected jobs logs
  • If possible, adapt the "reason" to be more specific with details about the actual error

Related issues 2 (1 open1 closed)

Copied from openQA Project - action #69448: test incompletes with auto_review:"(?s)was downloaded by.*details.*unavailable here.*Failed to download":retry , not helpful detailsWorkable2020-07-30

Actions
Copied to openQA Project - action #71827: test incompletes with auto_review:"(?s)Failed to download.*Asset was pruned immediately after download":retry because worker cache prunes the asset it just downloadedResolvedmkittler2020-07-30

Actions
Actions #1

Updated by okurz over 4 years ago

  • Copied from action #69448: test incompletes with auto_review:"(?s)was downloaded by.*details.*unavailable here.*Failed to download":retry , not helpful details added
Actions #2

Updated by mkittler about 4 years ago

The linked job is gone so there's not much information to investigate the issue. I've tried find -name autoinst-log.txt -exec grep --with-filename -Pzo '(?s).*Download.*successful.*Failed to download.*' {} \; on the recent job history on OSD but I've only found incomplete jobs where the 2nd of two downloads failed. Not sure whether I can consider the ticket workable with so few information.


Note that the "successful" part comes from the cache service and the "failed" part from the worker which verifies again whether the asset is actually present. So apparently the cache service thinks the asset could be downloaded but the worker couldn't find it.

Try to include the actual error details into the affected jobs logs

From the point of we cache service everything went well so there are no error details. The worker could not find the asset. That's an error but there's no more information to add from the worker side.

I have two theories:

  1. The error handling of the cache service is insufficient.
  2. Something bad happened after the cache service concluded the download and before the worker checked for the presence of the asset.

If it is 2. I'm don't think there's much we can do.

Actions #3

Updated by okurz about 4 years ago

  • Priority changed from Normal to Low
  • Target version changed from Ready to future

Hm, we have a workaround in place which is even automatically applied, QA tools team is limited in capacity and nobody seems to have a clear idea what to do so we can also regard is Low+future

Actions #4

Updated by okurz about 4 years ago

  • Priority changed from Low to High
  • Target version changed from future to Ready

A query to find corresponding jobs:

select jobs.id,t_finished,state,result,test,reason,host from jobs, comments, workers where t_finished >= '2020-08-01' and jobs.id = comments.job_id and comments.text ~ '69451' order by t_finished desc limit 10;

this shows that this ticket is used a lot recently and the problem reproduces. This is making the issue more prominent and important. And yes, it can be that some downloads in a job succeed, others do not.

In https://openqa.suse.de/tests/4612518# there is

[info] [#1051] Download of "/var/lib/openqa/cache/openqa.suse.de/SLES-15-SP3-aarch64-21.1@aarch64-filesystem_withouthome-uefi-vars.qcow2" failed: 404 Not Found

so maybe the behaviour changed a bit and now it says "404 Not Found" but still this is not enough to be clear to readers because in before it still said "Download successful".

Actions #5

Updated by mkittler about 4 years ago

  • Assignee set to mkittler
  • Priority changed from High to Normal

Your query doesn't make much sense, you need to join the tables to get actually different job IDs.

The query

select concat('https://openqa.suse.de/t', jobs.id) as job, jobs.t_finished, text as comment, (select concat(host, ':', instance) as worker from workers where workers.id = jobs.assigned_worker_id) from jobs join comments on jobs.id = comments.job_id where t_finished >= '2020-08-01' and jobs.id = comments.job_id and comments.text like '%poo#69451%' order by t_finished desc;

returns currently 36 jobs on OSD and 9 jobs on o3.

I had a look at quite a share of these jobs. Most jobs I've checked have either no more logs or they are actually just containing multiple downloads and one was successful and others not. Considering so many jobs being wrongly associated with that ticket I wouldn't deduce the frequency this problem occurs from this SQL query.

I only found one job which where the log actually matches the log from the ticket's observation:

[2020-09-17T18:00:38.0616 CEST] [info] [pid:9073] +++ setup notes +++
[2020-09-17T18:00:38.0616 CEST] [info] [pid:9073] Running on openqaworker9:23 (Linux 4.12.14-lp151.28.67-default #1 SMP Fri Sep 4 15:23:21 UTC 2020 (2c5a14f) x86_64)
[2020-09-17T18:00:38.0625 CEST] [debug] [pid:9073] Found ISO, caching SLE-12-SP4-Server-DVD-x86_64-GM-DVD1.iso
[2020-09-17T18:00:38.0628 CEST] [info] [pid:9073] Downloading SLE-12-SP4-Server-DVD-x86_64-GM-DVD1.iso, request #7271 sent to Cache Service
[2020-09-17T18:08:59.0731 CEST] [info] [pid:9073] Download of SLE-12-SP4-Server-DVD-x86_64-GM-DVD1.iso processed:
[info] [#7271] Cache size of "/var/lib/openqa/cache" is 49GiB, with limit 50GiB
[info] [#7271] Downloading "SLE-12-SP4-Server-DVD-x86_64-GM-DVD1.iso" from "http://openqa.suse.de/tests/4704446/asset/iso/SLE-12-SP4-Server-DVD-x86_64-GM-DVD1.iso"
[info] [#7271] Cache size 50GiB + needed 4GiB exceeds limit of 50GiB, purging least used assets
[info] [#7271] Purging "/var/lib/openqa/cache/openqa.suse.de/sle-15-x86_64-ha-rolling_upgrade_update.qcow2" because we need space for new assets, reclaiming 1.1GiB
[info] [#7271] Purging "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP1-Installer-DVD-x86_64-GM-DVD1.iso" because we need space for new assets, reclaiming 663MiB
[info] [#7271] Purging "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP3-Online-x86_64-Build36.5-Media1.iso.sha256" because we need space for new assets, reclaiming 112 Byte
[info] [#7271] Purging "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP3-x86_64-Build36.5-sles4sap-gnome.qcow2" because we need space for new assets, reclaiming 1.6GiB
[info] [#7271] Purging "/var/lib/openqa/cache/openqa.suse.de/SLE-12-SP4-Server-DVD-x86_64-GM-DVD1.iso" because we need space for new assets, reclaiming 4GiB
[info] [#7271] Size of "/var/lib/openqa/cache/openqa.suse.de/SLE-12-SP4-Server-DVD-x86_64-GM-DVD1.iso" is 4GiB, with ETag ""101000000-5a69b6dce2b91""
[info] [#7271] Download of "/var/lib/openqa/cache/openqa.suse.de/SLE-12-SP4-Server-DVD-x86_64-GM-DVD1.iso" successful, new cache size is 47GiB

[2020-09-17T18:08:59.0732 CEST] [error] [pid:9073] Failed to download SLE-12-SP4-Server-DVD-x86_64-GM-DVD1.iso to /var/lib/openqa/cache/openqa.suse.de/SLE-12-SP4-Server-DVD-x86_64-GM-DVD1.iso
[2020-09-17T18:08:59.0851 CEST] [info] [pid:9073] +++ worker notes +++
[2020-09-17T18:08:59.0852 CEST] [info] [pid:9073] End time: 2020-09-17 16:08:59
[2020-09-17T18:08:59.0852 CEST] [info] [pid:9073] Result: setup failure
[2020-09-17T18:08:59.0876 CEST] [info] [pid:54995] Uploading autoinst-log.txt

(log of https://openqa.suse.de/tests/4704446)

Here the asset has been purged while the download was still ongoing. Maybe that was also the case for https://openqa.suse.de/tests/4504422 (the job mentioned in the ticket description). Unfortunately the job's log has already been cleaned up so it is impossible to check that at this point.

So maybe there's a bug which makes it possible for assets to be purged while still being downloaded. I'll have a look at the asset cache code.

Actions #6

Updated by mkittler about 4 years ago

We indeed run the purging before submitting a newly downloaded asset in order to make room for it. It looks like the code actually doesn't ensure the new asset isn't purged itself.

Actions #7

Updated by okurz about 4 years ago

hm, right. Well, overbooking the cache size can be dangerous but downloading an asset that gets deleted immediately is not efficient. For a start I hope we can we do something very simple: Improve the messaging and incomplete the job with a reason that triggers an automatic retrigger.

EDIT: The first thing I am commonly proposing is to just change strings as a pointer to the reader of logfiles. Not much of a solution but can save some time when people are looking for workarounds 🙂 If we would just have a unique message then we could try to match it with auto-review

Actions #8

Updated by okurz about 4 years ago

  • Subject changed from test incompletes with auto_review:"(?s)Download.*successful.*Failed to download":retry , not helpful details to test incompletes with auto_review:"(?s)Download.*successful.*Failed to download":retry, not helpful details
  • Description updated (diff)

Clarifing the AC about "referencing the same asset". We have to be careful that the auto-review regex is rather generic so this ticket will catch the cases where the downloaded asset was purged as well as the case when other assets failed to download.

Actions #9

Updated by okurz about 4 years ago

  • Copied to action #71827: test incompletes with auto_review:"(?s)Failed to download.*Asset was pruned immediately after download":retry because worker cache prunes the asset it just downloaded added
Actions #10

Updated by okurz about 4 years ago

  • Subject changed from test incompletes with auto_review:"(?s)Download.*successful.*Failed to download":retry, not helpful details to [epic] test incompletes with auto_review:"(?s)Download.*successful.*Failed to download":retry, not helpful details
  • Status changed from Workable to Blocked
  • Assignee changed from mkittler to okurz
Actions #11

Updated by okurz about 4 years ago

Added a specific message in https://github.com/os-autoinst/openQA/pull/3418

As soon as the change is deployed to both o3 and osd we can remove the auto_review string here and put it into the related ticket #71827 instead

Actions #12

Updated by szarate about 4 years ago

  • Tracker changed from action to coordination
  • Status changed from Blocked to New
Actions #14

Updated by okurz about 4 years ago

mkittler wrote:

Your query doesn't make much sense, you need to join the tables to get actually different job IDs.

The query

select concat('https://openqa.suse.de/t', jobs.id) as job, jobs.t_finished, text as comment, (select concat(host, ':', instance) as worker from workers where workers.id = jobs.assigned_worker_id) from jobs join comments on jobs.id = comments.job_id where t_finished >= '2020-08-01' and jobs.id = comments.job_id and comments.text like '%poo#69451%' order by t_finished desc;

I consider your choice of words "doesn't make much sense" a bit harsh. I think this can be done a bit more simple by adding to my query the part and jobs.assigned_worker_id = workers.id so you have something like:

select jobs.id,t_finished,state,result,test,reason,host from jobs, comments, workers where t_finished >= '2020-08-01' and jobs.id = comments.job_id and jobs.assigned_worker_id = workers.id and comments.text ~ '#69451' order by t_finished desc limit 10;
Actions #15

Updated by okurz about 4 years ago

  • Subject changed from [epic] test incompletes with auto_review:"(?s)Download.*successful.*Failed to download":retry, not helpful details to [epic] test incompletes with "(?s)Download.*successful.*Failed to download":retry, not helpful details
  • Status changed from New to Blocked

waiting for related ticket #71827 after https://github.com/os-autoinst/openQA/pull/3418 is in

Actions #16

Updated by okurz about 4 years ago

  • Parent task set to #62420
Actions #17

Updated by okurz over 3 years ago

  • Status changed from Blocked to New
  • Assignee deleted (okurz)
Actions #18

Updated by okurz over 3 years ago

  • Target version changed from Ready to future
Actions #19

Updated by okurz almost 3 years ago

  • Parent task changed from #62420 to #102909
Actions

Also available in: Atom PDF