action #72055
closeddownload_asset tasks are not reliable
Description
For a new project I'm posting 6 jobs with the same HDD_1_URL. All 6 trigger a download_asset task, but in about every 5th run one of the tasks fails me - and the associated job incompletes with 404.
In rocketchat we discussed that we don't see the deduplication warning in the GRU log in these cases.
Updated by coolo almost 4 years ago
https://openqa.suse.de/tests/4744520 is the latest example. I could only find 3 download_asset tasks
{
"args" => [
"http://carwos-runner.qa.suse.de/gitlab/carwos/carwos-carwos_master-264511.qcow2",
"/var/lib/openqa/share/factory/hdd/carwos-carwos_master-264511.qcow2",
0
],
"attempts" => 1,
"children" => [],
"created" => "2020-09-28T21:51:55.75523Z",
"delayed" => "2020-09-28T21:51:55.75523Z",
"expires" => undef,
"finished" => "2020-09-28T21:51:57.99779Z",
"id" => 663222,
"lax" => 0,
"notes" => {
"gru_id" => 27777156
},
"parents" => [],
"priority" => 20,
"queue" => "default",
"result" => "Job successfully executed",
"retried" => undef,
"retries" => 0,
"started" => "2020-09-28T21:51:55.75792Z",
"state" => "finished",
"task" => "download_asset",
"time" => "2020-09-29T05:16:34.48117Z",
"worker" => 351
}
(not retried)
{
"args" => [
"http://carwos-runner.qa.suse.de/gitlab/carwos/carwos-carwos_master-264511.qcow2",
"/var/lib/openqa/share/factory/hdd/carwos-carwos_master-264511.qcow2",
0
],
"attempts" => 1,
"children" => [],
"created" => "2020-09-28T21:51:56.55585Z",
"delayed" => "2020-09-28T21:52:26.59534Z",
"expires" => undef,
"finished" => "2020-09-28T21:52:27.6253Z",
"id" => 663223,
"lax" => 0,
"notes" => {
"gru_id" => 27777157
},
"parents" => [],
"priority" => 20,
"queue" => "default",
"result" => "Job successfully executed",
"retried" => "2020-09-28T21:51:56.59534Z",
"retries" => 1,
"started" => "2020-09-28T21:52:27.58941Z",
"state" => "finished",
"task" => "download_asset",
"time" => "2020-09-29T05:16:34.48117Z",
"worker" => 351
}
(1 retries)
{
"args" => [
"http://carwos-runner.qa.suse.de/gitlab/carwos/carwos-carwos_master-264511.qcow2",
"/var/lib/openqa/share/factory/hdd/carwos-carwos_master-264511.qcow2",
0
],
"attempts" => 1,
"children" => [],
"created" => "2020-09-28T21:51:57.41415Z",
"delayed" => "2020-09-28T21:52:27.45291Z",
"expires" => undef,
"finished" => "2020-09-28T21:52:27.64024Z",
"id" => 663224,
"lax" => 0,
"notes" => {
"gru_id" => 27777158
},
"parents" => [],
"priority" => 20,
"queue" => "default",
"result" => "Job successfully executed",
"retried" => "2020-09-28T21:51:57.45291Z",
"retries" => 1,
"started" => "2020-09-28T21:52:27.60471Z",
"state" => "finished",
"task" => "download_asset",
"time" => "2020-09-29T05:16:34.48117Z",
"worker" => 351
}
(1 retried also)
The failed job was not the first to be scheduled - 2s later the asset was properly available
Sep 28 23:51:57 openqa openqa[24565]: [debug] [jNHJI9DV] GET "/tests/4744520/asset/hdd/carwos-carwos_master-264511.qcow2"
Sep 28 23:51:59 openqa openqa[24565]: [debug] [b1wvAEhb] GET "/tests/4744519/asset/hdd/carwos-carwos_master-264511.qcow2"
Sep 28 23:51:59 openqa openqa[24565]: [debug] redirect to /assets/hdd/carwos-carwos_master-264511.qcow2
Sep 28 23:51:59 openqa openqa[24565]: [debug] [k8eUrvcM] GET "/tests/4744522/asset/hdd/carwos-carwos_master-264511.qcow2"
Sep 28 23:51:59 openqa openqa[24565]: [debug] redirect to /assets/hdd/carwos-carwos_master-264511.qcow2
Sep 28 23:52:01 openqa openqa[24565]: [debug] [27d5l-9q] GET "/tests/4744524/asset/hdd/carwos-carwos_master-264511.qcow2"
Sep 28 23:52:01 openqa openqa[24565]: [debug] redirect to /assets/hdd/carwos-carwos_master-264511.qcow2
So my theory is that there was no GRU job associated with the failed job
Updated by okurz almost 4 years ago
- Assignee set to mkittler
- Target version set to Ready
Just adding what I had wrote in the upfront chat as well: I don't think this "Skipping" is used for downloads of assets to the webui but only for the cache service on the worker. It was introduced with https://github.com/os-autoinst/openQA/pull/2736 and AFAIK sriedel never implemented something for the asset download to webui, only worker asset cache.
Likely this never worked so far and can be considered a feature request.
mkittler is already working in a related area with https://github.com/os-autoinst/openQA/pull/3429 so I will add him as assignee for now
Updated by mkittler almost 4 years ago
@okurz We actually skip asset downloads within the web UI as well and it also generally works in the case of @coolo's jobs, e.g.:
[2020-09-28T13:15:11.0201 CEST] [debug] Process 21999 is performing job "662519" with task "download_asset"
[2020-09-28T13:15:11.0232 CEST] [debug] [#662519] Downloading "http://carwos-runner.qa.suse.de/gitlab/coolo/carwos-coolo_branch_1-264330.qcow2" to "/var/lib/openqa/share/factory/hdd/carwos-coolo_branch_1-264330.qcow2"
[2020-09-28T13:15:11.0233 CEST] [info] [#662519] Downloading "carwos-coolo_branch_1-264330.qcow2" from "http://carwos-runner.qa.suse.de/gitlab/coolo/carwos-coolo_branch_1-264330.qcow2"
[2020-09-28T13:15:12.0013 CEST] [debug] Process 22001 is performing job "662520" with task "download_asset"
[2020-09-28T13:15:13.0073 CEST] [debug] Process 22003 is performing job "662521" with task "download_asset"
[2020-09-28T13:15:14.0057 CEST] [debug] Process 22006 is performing job "662522" with task "download_asset"
[2020-09-28T13:15:14.0814 CEST] [debug] [#662519] Download of "/var/lib/openqa/share/factory/hdd/carwos-coolo_branch_1-264330.qcow2" successful
[2020-09-28T13:15:44.0191 CEST] [debug] Process 22131 is performing job "662520" with task "download_asset"
[2020-09-28T13:15:44.0207 CEST] [debug] Process 22133 is performing job "662521" with task "download_asset"
[2020-09-28T13:15:44.0220 CEST] [debug] Process 22137 is performing job "662522" with task "download_asset"
[2020-09-28T13:15:44.0230 CEST] [info] [#662520] Skipping download of "http://carwos-runner.qa.suse.de/gitlab/coolo/carwos-coolo_branch_1-264330.qcow2" to "/var/lib/openqa/share/factory/hdd/carwos-coolo_branch_1-264330.qcow2" because file already exists
[2020-09-28T13:15:44.0236 CEST] [info] [#662521] Skipping download of "http://carwos-runner.qa.suse.de/gitlab/coolo/carwos-coolo_branch_1-264330.qcow2" to "/var/lib/openqa/share/factory/hdd/carwos-coolo_branch_1-264330.qcow2" because file already exists
[2020-09-28T13:15:44.0245 CEST] [info] [#662522] Skipping download of "http://carwos-runner.qa.suse.de/gitlab/coolo/carwos-coolo_branch_1-264330.qcow2" to "/var/lib/openqa/share/factory/hdd/carwos-coolo_branch_1-264330.qcow2" because file already exists
This is actually also what my PR which you have also been mentioning is about. It fixes the behavior when one download has multiple destinations. However, this is not the case in the jobs this ticket is about.
In RC we were takling about this job: https://openqa.suse.de/tests/4743806
I've been checking out the related Minion jobs using this SQL query but couldn't find anything useful:
openqa=# select concat('https://openqa.suse.de/minion/jobs?id=', id) as url, state, result from minion_jobs where task = 'download_asset' and args::text like '%carwos-coolo_branch_1-264225.qcow2%';
url | state | result
----------------------------------------------+----------+-----------------------------
https://openqa.suse.de/minion/jobs?id=662340 | finished | "Job successfully executed"
https://openqa.suse.de/minion/jobs?id=662339 | finished | "Job successfully executed"
https://openqa.suse.de/minion/jobs?id=662341 | finished | "Job successfully executed"
(3 Zeilen)
Despite these tasks being successful, the worker cache fails to download the assets. In case the log is being cleaned up on OSD, that's how it looked like:
[2020-09-28T11:20:42.0581 CEST] [info] [pid:60606] +++ setup notes +++
[2020-09-28T11:20:42.0582 CEST] [info] [pid:60606] Running on openqaworker8:21 (Linux 4.12.14-lp151.28.67-default #1 SMP Fri Sep 4 15:23:21 UTC 2020 (2c5a14f) x86_64)
[2020-09-28T11:20:42.0587 CEST] [debug] [pid:60606] Found HDD_1, caching carwos-coolo_branch_1-264225.qcow2
[2020-09-28T11:20:42.0590 CEST] [info] [pid:60606] Downloading carwos-coolo_branch_1-264225.qcow2, request #20974 sent to Cache Service
[2020-09-28T11:20:47.0645 CEST] [info] [pid:60606] Download of carwos-coolo_branch_1-264225.qcow2 processed:
[info] [#20974] Cache size of "/var/lib/openqa/cache" is 50GiB, with limit 50GiB
[info] [#20974] Downloading "carwos-coolo_branch_1-264225.qcow2" from "http://openqa.suse.de/tests/4743806/asset/hdd/carwos-coolo_branch_1-264225.qcow2"
[info] [#20974] Download of "/var/lib/openqa/cache/openqa.suse.de/carwos-coolo_branch_1-264225.qcow2" failed: 404 Not Found
[2020-09-28T11:20:47.0646 CEST] [error] [pid:60606] Failed to download carwos-coolo_branch_1-264225.qcow2 to /var/lib/openqa/cache/openqa.suse.de/carwos-coolo_branch_1-264225.qcow2
[2020-09-28T11:20:47.0669 CEST] [info] [pid:60606] +++ worker notes +++
[2020-09-28T11:20:47.0669 CEST] [info] [pid:60606] End time: 2020-09-28 09:20:47
[2020-09-28T11:20:47.0669 CEST] [info] [pid:60606] Result: setup failure
[2020-09-28T11:20:47.0708 CEST] [info] [pid:35015] Uploading autoinst-log.txt
There's another instance: https://openqa.suse.de/tests/4744520
The log and relevant Minion jobs look identical. Only the asset name is different (carwos-carwos_master-264511.qcow2
).
So my theory is that there was no GRU job associated with the failed job
That's also what I suspect based on the information we've gathered so far. So I'm checking the relevant code for potential bugs.
Updated by mkittler almost 4 years ago
Since all your post requests succeeded I assume no error occurred when creating the GRU job. I've also read the code for the task creation. It looks good as well as its unit test. There's only one possibility I can think of: The scheduler assigns the job to a worker in the small time frame after the job has been created but the GRU dependencies have not been added yet to the database. Maybe it helps to do all of this within a transaction using the right isolation level.
Updated by mkittler almost 4 years ago
- Status changed from New to In Progress
This PR fixes the problem mentioned in the previous comment: https://github.com/os-autoinst/openQA/pull/3436
Of course it is not clear yet whether this problem is the cause of issue you've encountered. So let me know whether it helped (when the PR has been merged and deployed).
Updated by mkittler almost 4 years ago
- Status changed from In Progress to Feedback
PR has been merged
Updated by mkittler almost 4 years ago
The change is on OSD since the deployment yesterday. So let me know if it worked.
Updated by mkittler almost 4 years ago
- Status changed from Feedback to Resolved
select concat('https://openqa.suse.de/tests/', id) as url from jobs where reason like '%Failed to download carwos%';
shows only 3 jobs and they failed before the PR has been merged. There was also no further response from you so I'm marking the issue as resolved.