Project

General

Profile

action #62159

Asset GRU download not done by web UI host if job scheduled by `isos post`, fails to download and then cloned (was: … using the Web UI)

Added by favogt 7 months ago. Updated about 3 hours ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2020-01-15
Due date:
% Done:

0%

Estimated time:
Difficulty:
Duration:

Description

Observation

https://openqa.opensuse.org/tests/1144395 has

HDD_1_URL=http://download.opensuse.org/tumbleweed/appliances/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2?foobar=20200113

but the test fails with

[info] [#192844] Purging "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Tumbleweed-JeOS.x86_64-old-20200113.qcow2" because the download failed: 404 - Not Found

GRU logs show that no download was attempted at all.

Cloning the job using the Web UI results in the same error 100% reproducible.

However, using just

openqa-clone-job 1144395

results in a working job, https://openqa.opensuse.org/tests/1144396

This is currently the (only) blocker for JeOS zdup tests for TW.

See also https://progress.opensuse.org/issues/57617, which this is a part of...

Steps to reproduce

See #62159#note-14


Related issues

Related to openQA Project - action #57782: retrigger of job with failed gru download task ends up incomplete with 404 on asset, does not retry downloadResolved2019-10-08

Related to openQA Project - action #62459: Retry on download errors within GRU download tasksResolved2020-01-21

Has duplicate openQA Tests - action #65025: [opensuse][aarch64][jeos] consistently incompleting scenario that never worked "opensuse-Tumbleweed-JeOS-for-AArch64-aarch64-jeos_tw_zdup_aarch64@aarch64"New2020-03-31

History

#1 Updated by mkittler 7 months ago

  • Related to action #57782: retrigger of job with failed gru download task ends up incomplete with 404 on asset, does not retry download added

#2 Updated by mkittler 7 months ago

The code which parses the settings variables is the same for the ISO post as for the single jobs post (parse_assets_from_settings method). The code for enqueuing the download jobs is also the same in both cases (enqueue_download_jobs method). Hence it must be the way enqueue_download_jobs is called when scheduling the ISO is buggy. But both use create_downloads_list for this. So I'm not sure what makes the difference here.

#3 Updated by mkittler 7 months ago

Cloning the job using the Web UI results in the same error 100% reproducible.

I assume you mean clicking the "restart" button on the web UI. Restarting the job via the web UI does not help because openQA's job duplication code simply does not create a new download task. That is a known limitation, see https://progress.opensuse.org/issues/57782#note-2 and https://progress.opensuse.org/issues/57782#note-5. I've also created a draft to prevent jobs with missing assets from being restarted in the first place which goes in the opposite direction. Maybe we should clarify whether we want to retrigger downloads or not. I've been messing the the related code recently so I would agree with coolo's statement from the other issue:

I wouldn't restart the download from retriggering jobs. Everyone who knows the retriggering jobs code will agree :)

#4 Updated by okurz 7 months ago

  • Category set to Feature requests

mkittler wrote:

[…] Maybe we should clarify whether we want to retrigger downloads or not.

I guess it's a reasonable expectation that a "retry" (aka. retrigger) would retry what openQA was asked to do initially, that includes the download of necessary assets.

#5 Updated by favogt 7 months ago

How is this a feature request? It's quite clearly a bug.

#6 Updated by okurz 7 months ago

I am following what we defined on https://progress.opensuse.org/projects/openqav3/wiki#ticket-categories . Categorizing it does not have a direct impact on severity or our priority of the issue. To my understanding this never worked. Maybe I am misreading your observation and this really a regression? In this case could you help us to find any "last good"?

#7 Updated by mkittler 7 months ago

  • Category changed from Feature requests to Concrete Bugs

It is a bug that GRU didn't download the asset in the first place (regardless of the restart feature which I have only mentioned because favogt tried to use it as workaround). The problem is also reproducible, e.g. further jobs on o3 of the scenario show the problem again.

#8 Updated by okurz 7 months ago

mkittler so do you know since when this regression was introduced then?

#9 Updated by okurz 7 months ago

  • Related to action #62459: Retry on download errors within GRU download tasks added

#10 Updated by favogt 5 months ago

  • Has duplicate action #65025: [opensuse][aarch64][jeos] consistently incompleting scenario that never worked "opensuse-Tumbleweed-JeOS-for-AArch64-aarch64-jeos_tw_zdup_aarch64@aarch64" added

#11 Updated by favogt 2 months ago

Any news here? This is also needed by MicroOS tests now.

rbrown added a download cron job to workaround this, but that's not great for multiple reasons.

#12 Updated by mkittler 2 months ago

After reading the ticket description again I'm not sure anymore what this ticket is about. Is it about

  1. restarting a job within the web UI or API? That's now actually prevented if there are missing assets (with a force option) as requested by #34783.
  2. posting "an ISO" via the API? (Likely that's not the case because the job mentioned in the description has not been created by posting an ISO.)
  3. posting a single job via the API?

If it is option 1. that still leaves the question how the job has been created initially and the the asset download failed in the first place.

If there are more recent examples, can you provide some links?

#13 Updated by favogt 2 months ago

The job was created by posting an ISO, through the obs_rsync scripts.

Both 1 and 3 should be fine AFAIK, though I haven't tried that again.

As okurz removed the test for some reason, I don't have any recent example. I'll try to create a minimal PoC locally.

#14 Updated by favogt 2 months ago

favogt wrote:

As okurz removed the test for some reason, I don't have any recent example. I'll try to create a minimal PoC locally.

Done and copied to o3: https://openqa.opensuse.org/tests/1293416#details

To reproduce: openqa-client --host http://openqa.opensuse.org isos post DISTRI=opensuse VERSION=1 FLAVOR=poo62159 ARCH=x86_64 BUILD=1
The created job incompletes, but when cloning it with openqa-clone-job it creates a GRU task and works.

#15 Updated by okurz 30 days ago

  • Description updated (diff)
  • Status changed from New to Workable
  • Target version set to Ready

#16 Updated by okurz 14 days ago

  • Subject changed from Asset download not done if job scheduled using the Web UI to Asset GRU download not done by web UI host if job scheduled by `isos post` (was: … using the Web UI)

#17 Updated by okurz 14 days ago

  • Subject changed from Asset GRU download not done by web UI host if job scheduled by `isos post` (was: … using the Web UI) to Asset GRU download not done by web UI host if job scheduled by `isos post`, fails to download and then cloned (was: … using the Web UI)

#18 Updated by Xiaojing_liu 3 days ago

The difference in creating download asset between 'isos post' and 'job create' is that: 'isos post' uses the arguments to create the download list, and 'job create' (such as openqa-clone-job) uses the job's setting to create the download list. Maybe this can explain why 'openqa-clone-job' works, but 'isos post' does not works.

I am not sure if we should call the create_downloads_list for every job when doing 'isos post', because there may be many jobs will be created.

favogt
workaround:
openqa-client --host http://openqa.opensuse.org isos post DISTRI=opensuse VERSION=1 FLAVOR=poo62159 ARCH=x86_64 BUILD=1 HDD_1=openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2 HDD_1_URL=http://download.opensuse.org/tumbleweed/appliances/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2?foobar=1

#19 Updated by okurz 2 days ago

Xiaojing_liu wrote:

I am not sure if we should call the create_downloads_list for every job when doing 'isos post', because there may be many jobs will be created.

I think this should be the same as calling jobs post … HDD_1_URL=http://download.opensuse.org/my/same/asset.qcow2 10 times. I suggest to just try this out and see what happens.

I recommend to just try out what happens when we call create_downloads_list.

favogt
workaround:
openqa-client --host http://openqa.opensuse.org isos post DISTRI=opensuse VERSION=1 FLAVOR=poo62159 ARCH=x86_64 BUILD=1 HDD_1=openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2 HDD_1_URL= http://download.opensuse.org/tumbleweed/appliances/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2?foobar=1

I think there are spaces between HDD_1_URL= and its arguments which should not be there.

#20 Updated by mkittler 2 days ago

  • Category changed from Concrete Bugs to Feature requests

@fvogt We came to the conclusion that this issue is actually: Add support for triggering GRU asset downloads via "isos post" when the relevant _URL parameters are not directly provided but only pulled from e.g. the test suites table

Is that right? I'm just wondering because in the ticket description this feature request is mixed up with restarting jobs and errors reported from the worker's asset cache.

#21 Updated by Xiaojing_liu 2 days ago

  • Category changed from Feature requests to Concrete Bugs

okurz wrote:

Xiaojing_liu wrote:

I am not sure if we should call the create_downloads_list for every job when doing 'isos post', because there may be many jobs will be created.

I think this should be the same as calling jobs post … HDD_1_URL=http://download.opensuse.org/my/same/asset.qcow2 10 times. I suggest to just try this out and see what happens.

I recommend to just try out what happens when we call create_downloads_list.

Here is the test result:
calling job post 10 times:

        # for i in $(seq 10); do openqa-cli api -X post jobs --host http://10.67.19.103 TEST=kde DISTRI=sle MACHINE=64bit; done

The job ids are : 186 ... 195.
we could see that there are 10 records in database table gru_tasks are created

id taskname args run_at priority t_created t_updated
5284 download_asset ["http:\/\/download.opensuse.org\/tumbleweed\/appliances\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2","\/var\/lib\/openqa\/share\/factory\/hdd\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2",0] 2020-08-11 10:31:52 20 2020-08-11 10:31:52 2020-08-11 10:31:52
5283 download_asset ["http:\/\/download.opensuse.org\/tumbleweed\/appliances\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2","\/var\/lib\/openqa\/share\/factory\/hdd\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2",0] 2020-08-11 10:31:52 20 2020-08-11 10:31:52 2020-08-11 10:31:52
5282 download_asset ["http:\/\/download.opensuse.org\/tumbleweed\/appliances\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2","\/var\/lib\/openqa\/share\/factory\/hdd\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2",0] 2020-08-11 10:31:51 20 2020-08-11 10:31:51 2020-08-11 10:31:51
5281 download_asset ["http:\/\/download.opensuse.org\/tumbleweed\/appliances\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2","\/var\/lib\/openqa\/share\/factory\/hdd\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2",0] 2020-08-11 10:31:51 20 2020-08-11 10:31:51 2020-08-11 10:31:51
5280 download_asset ["http:\/\/download.opensuse.org\/tumbleweed\/appliances\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2","\/var\/lib\/openqa\/share\/factory\/hdd\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2",0] 2020-08-11 10:31:51 20 2020-08-11 10:31:51 2020-08-11 10:31:51
5279 download_asset ["http:\/\/download.opensuse.org\/tumbleweed\/appliances\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2","\/var\/lib\/openqa\/share\/factory\/hdd\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2",0] 2020-08-11 10:31:50 20 2020-08-11 10:31:50 2020-08-11 10:31:50
5278 download_asset ["http:\/\/download.opensuse.org\/tumbleweed\/appliances\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2","\/var\/lib\/openqa\/share\/factory\/hdd\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2",0] 2020-08-11 10:31:50 20 2020-08-11 10:31:50 2020-08-11 10:31:50
5277 download_asset ["http:\/\/download.opensuse.org\/tumbleweed\/appliances\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2","\/var\/lib\/openqa\/share\/factory\/hdd\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2",0] 2020-08-11 10:31:50 20 2020-08-11 10:31:50 2020-08-11 10:31:50
5276 download_asset ["http:\/\/download.opensuse.org\/tumbleweed\/appliances\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2","\/var\/lib\/openqa\/share\/factory\/hdd\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2",0] 2020-08-11 10:31:50 20 2020-08-11 10:31:50 2020-08-11 10:31:50
5275 download_asset ["http:\/\/download.opensuse.org\/tumbleweed\/appliances\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2","\/var\/lib\/openqa\/share\/factory\/hdd\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2",0] 2020-08-11 10:31:49 20 2020-08-11 10:31:49 2020-08-11 10:31:49

(10 rows)

And the gru_dependencies result is

job_id gru_task_id
195 5284
194 5283
193 5282
192 5281
191 5280
190 5279
189 5278
188 5277
187 5276
186 5275

(10 rows)

#22 Updated by Xiaojing_liu 2 days ago

  • Category changed from Concrete Bugs to Feature requests

#23 Updated by mkittler about 23 hours ago

Ok, so that would create multiple asset downloads. Judging by the code in openQA/lib/OpenQA/Task/Asset/Download.pm nothing bad will happen in that case. There's a lock to prevent concurrently downloading the same asset and a check to prevent downloading an existing asset again.

I suppose it would nevertheless be a good idea to de-duplicate the download lists for the jobs created by isos post by the download destination to produce less overhead. (E.g. enqueue_download_jobs would accept multiple download lists and skips "visited" download destinations.)

#24 Updated by Xiaojing_liu about 3 hours ago

  • Status changed from Workable to In Progress
  • Assignee set to Xiaojing_liu

Also available in: Atom PDF