Project

General

Profile

action #62159

Asset GRU download not done by web UI host if job scheduled by `isos post`, fails to download and then cloned (was: … using the Web UI)

Added by favogt about 1 year ago. Updated 4 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2020-01-15
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

https://openqa.opensuse.org/tests/1144395 has

HDD_1_URL=http://download.opensuse.org/tumbleweed/appliances/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2?foobar=20200113

but the test fails with

[info] [#192844] Purging "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Tumbleweed-JeOS.x86_64-old-20200113.qcow2" because the download failed: 404 - Not Found

GRU logs show that no download was attempted at all.

Cloning the job using the Web UI results in the same error 100% reproducible.

However, using just

openqa-clone-job 1144395

results in a working job, https://openqa.opensuse.org/tests/1144396

This is currently the (only) blocker for JeOS zdup tests for TW.

See also https://progress.opensuse.org/issues/57617, which this is a part of...

Steps to reproduce

See #62159#note-14


Related issues

Related to openQA Project - action #57782: retrigger of job with failed gru download task ends up incomplete with 404 on asset, does not retry downloadResolved2019-10-08

Related to openQA Project - action #62459: Retry on download errors within GRU download tasksResolved2020-01-21

Related to openQA Project - action #70687: Download gru is attached to all scheduled jobs when doing 'isos post'Resolved2020-08-31

Related to openQA Project - action #72142: Avoid problematic symlinking in download assets tasks of the web UINew2020-09-30

Has duplicate openQA Tests - action #65025: [opensuse][aarch64][jeos] consistently incompleting scenario that never worked "opensuse-Tumbleweed-JeOS-for-AArch64-aarch64-jeos_tw_zdup_aarch64@aarch64"Resolved2020-03-31

History

#1 Updated by mkittler almost 1 year ago

  • Related to action #57782: retrigger of job with failed gru download task ends up incomplete with 404 on asset, does not retry download added

#2 Updated by mkittler 12 months ago

The code which parses the settings variables is the same for the ISO post as for the single jobs post (parse_assets_from_settings method). The code for enqueuing the download jobs is also the same in both cases (enqueue_download_jobs method). Hence it must be the way enqueue_download_jobs is called when scheduling the ISO is buggy. But both use create_downloads_list for this. So I'm not sure what makes the difference here.

#3 Updated by mkittler 12 months ago

Cloning the job using the Web UI results in the same error 100% reproducible.

I assume you mean clicking the "restart" button on the web UI. Restarting the job via the web UI does not help because openQA's job duplication code simply does not create a new download task. That is a known limitation, see https://progress.opensuse.org/issues/57782#note-2 and https://progress.opensuse.org/issues/57782#note-5. I've also created a draft to prevent jobs with missing assets from being restarted in the first place which goes in the opposite direction. Maybe we should clarify whether we want to retrigger downloads or not. I've been messing the the related code recently so I would agree with coolo's statement from the other issue:

I wouldn't restart the download from retriggering jobs. Everyone who knows the retriggering jobs code will agree :)

#4 Updated by okurz 12 months ago

  • Category set to Feature requests

mkittler wrote:

[…] Maybe we should clarify whether we want to retrigger downloads or not.

I guess it's a reasonable expectation that a "retry" (aka. retrigger) would retry what openQA was asked to do initially, that includes the download of necessary assets.

#5 Updated by favogt 12 months ago

How is this a feature request? It's quite clearly a bug.

#6 Updated by okurz 12 months ago

I am following what we defined on https://progress.opensuse.org/projects/openqav3/wiki#ticket-categories . Categorizing it does not have a direct impact on severity or our priority of the issue. To my understanding this never worked. Maybe I am misreading your observation and this really a regression? In this case could you help us to find any "last good"?

#7 Updated by mkittler 12 months ago

  • Category changed from Feature requests to Concrete Bugs

It is a bug that GRU didn't download the asset in the first place (regardless of the restart feature which I have only mentioned because favogt tried to use it as workaround). The problem is also reproducible, e.g. further jobs on o3 of the scenario show the problem again.

#8 Updated by okurz 12 months ago

mkittler so do you know since when this regression was introduced then?

#9 Updated by okurz 12 months ago

  • Related to action #62459: Retry on download errors within GRU download tasks added

#10 Updated by favogt 10 months ago

  • Has duplicate action #65025: [opensuse][aarch64][jeos] consistently incompleting scenario that never worked "opensuse-Tumbleweed-JeOS-for-AArch64-aarch64-jeos_tw_zdup_aarch64@aarch64" added

#11 Updated by favogt 8 months ago

Any news here? This is also needed by MicroOS tests now.

rbrown added a download cron job to workaround this, but that's not great for multiple reasons.

#12 Updated by mkittler 8 months ago

After reading the ticket description again I'm not sure anymore what this ticket is about. Is it about

  1. restarting a job within the web UI or API? That's now actually prevented if there are missing assets (with a force option) as requested by #34783.
  2. posting "an ISO" via the API? (Likely that's not the case because the job mentioned in the description has not been created by posting an ISO.)
  3. posting a single job via the API?

If it is option 1. that still leaves the question how the job has been created initially and the the asset download failed in the first place.

If there are more recent examples, can you provide some links?

#13 Updated by favogt 8 months ago

The job was created by posting an ISO, through the obs_rsync scripts.

Both 1 and 3 should be fine AFAIK, though I haven't tried that again.

As okurz removed the test for some reason, I don't have any recent example. I'll try to create a minimal PoC locally.

#14 Updated by favogt 8 months ago

favogt wrote:

As okurz removed the test for some reason, I don't have any recent example. I'll try to create a minimal PoC locally.

Done and copied to o3: https://openqa.opensuse.org/tests/1293416#details

To reproduce: openqa-client --host http://openqa.opensuse.org isos post DISTRI=opensuse VERSION=1 FLAVOR=poo62159 ARCH=x86_64 BUILD=1
The created job incompletes, but when cloning it with openqa-clone-job it creates a GRU task and works.

#15 Updated by okurz 6 months ago

  • Description updated (diff)
  • Status changed from New to Workable
  • Target version set to Ready

#16 Updated by okurz 6 months ago

  • Subject changed from Asset download not done if job scheduled using the Web UI to Asset GRU download not done by web UI host if job scheduled by `isos post` (was: … using the Web UI)

#17 Updated by okurz 6 months ago

  • Subject changed from Asset GRU download not done by web UI host if job scheduled by `isos post` (was: … using the Web UI) to Asset GRU download not done by web UI host if job scheduled by `isos post`, fails to download and then cloned (was: … using the Web UI)

#18 Updated by Xiaojing_liu 5 months ago

The difference in creating download asset between 'isos post' and 'job create' is that: 'isos post' uses the arguments to create the download list, and 'job create' (such as openqa-clone-job) uses the job's setting to create the download list. Maybe this can explain why 'openqa-clone-job' works, but 'isos post' does not works.

I am not sure if we should call the create_downloads_list for every job when doing 'isos post', because there may be many jobs will be created.

favogt
workaround:
openqa-client --host http://openqa.opensuse.org isos post DISTRI=opensuse VERSION=1 FLAVOR=poo62159 ARCH=x86_64 BUILD=1 HDD_1=openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2 HDD_1_URL=http://download.opensuse.org/tumbleweed/appliances/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2?foobar=1

#19 Updated by okurz 5 months ago

Xiaojing_liu wrote:

I am not sure if we should call the create_downloads_list for every job when doing 'isos post', because there may be many jobs will be created.

I think this should be the same as calling jobs post … HDD_1_URL=http://download.opensuse.org/my/same/asset.qcow2 10 times. I suggest to just try this out and see what happens.

I recommend to just try out what happens when we call create_downloads_list.

favogt
workaround:
openqa-client --host http://openqa.opensuse.org isos post DISTRI=opensuse VERSION=1 FLAVOR=poo62159 ARCH=x86_64 BUILD=1 HDD_1=openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2 HDD_1_URL= http://download.opensuse.org/tumbleweed/appliances/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2?foobar=1

I think there are spaces between HDD_1_URL= and its arguments which should not be there.

#20 Updated by mkittler 5 months ago

  • Category changed from Concrete Bugs to Feature requests

@fvogt We came to the conclusion that this issue is actually: Add support for triggering GRU asset downloads via "isos post" when the relevant _URL parameters are not directly provided but only pulled from e.g. the test suites table

Is that right? I'm just wondering because in the ticket description this feature request is mixed up with restarting jobs and errors reported from the worker's asset cache.

#21 Updated by Xiaojing_liu 5 months ago

  • Category changed from Feature requests to Concrete Bugs

okurz wrote:

Xiaojing_liu wrote:

I am not sure if we should call the create_downloads_list for every job when doing 'isos post', because there may be many jobs will be created.

I think this should be the same as calling jobs post … HDD_1_URL=http://download.opensuse.org/my/same/asset.qcow2 10 times. I suggest to just try this out and see what happens.

I recommend to just try out what happens when we call create_downloads_list.

Here is the test result:
calling job post 10 times:

        # for i in $(seq 10); do openqa-cli api -X post jobs --host http://10.67.19.103 TEST=kde DISTRI=sle MACHINE=64bit; done

The job ids are : 186 ... 195.
we could see that there are 10 records in database table gru_tasks are created

id taskname args run_at priority t_created t_updated
5284 download_asset ["http:\/\/download.opensuse.org\/tumbleweed\/appliances\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2","\/var\/lib\/openqa\/share\/factory\/hdd\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2",0] 2020-08-11 10:31:52 20 2020-08-11 10:31:52 2020-08-11 10:31:52
5283 download_asset ["http:\/\/download.opensuse.org\/tumbleweed\/appliances\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2","\/var\/lib\/openqa\/share\/factory\/hdd\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2",0] 2020-08-11 10:31:52 20 2020-08-11 10:31:52 2020-08-11 10:31:52
5282 download_asset ["http:\/\/download.opensuse.org\/tumbleweed\/appliances\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2","\/var\/lib\/openqa\/share\/factory\/hdd\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2",0] 2020-08-11 10:31:51 20 2020-08-11 10:31:51 2020-08-11 10:31:51
5281 download_asset ["http:\/\/download.opensuse.org\/tumbleweed\/appliances\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2","\/var\/lib\/openqa\/share\/factory\/hdd\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2",0] 2020-08-11 10:31:51 20 2020-08-11 10:31:51 2020-08-11 10:31:51
5280 download_asset ["http:\/\/download.opensuse.org\/tumbleweed\/appliances\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2","\/var\/lib\/openqa\/share\/factory\/hdd\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2",0] 2020-08-11 10:31:51 20 2020-08-11 10:31:51 2020-08-11 10:31:51
5279 download_asset ["http:\/\/download.opensuse.org\/tumbleweed\/appliances\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2","\/var\/lib\/openqa\/share\/factory\/hdd\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2",0] 2020-08-11 10:31:50 20 2020-08-11 10:31:50 2020-08-11 10:31:50
5278 download_asset ["http:\/\/download.opensuse.org\/tumbleweed\/appliances\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2","\/var\/lib\/openqa\/share\/factory\/hdd\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2",0] 2020-08-11 10:31:50 20 2020-08-11 10:31:50 2020-08-11 10:31:50
5277 download_asset ["http:\/\/download.opensuse.org\/tumbleweed\/appliances\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2","\/var\/lib\/openqa\/share\/factory\/hdd\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2",0] 2020-08-11 10:31:50 20 2020-08-11 10:31:50 2020-08-11 10:31:50
5276 download_asset ["http:\/\/download.opensuse.org\/tumbleweed\/appliances\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2","\/var\/lib\/openqa\/share\/factory\/hdd\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2",0] 2020-08-11 10:31:50 20 2020-08-11 10:31:50 2020-08-11 10:31:50
5275 download_asset ["http:\/\/download.opensuse.org\/tumbleweed\/appliances\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2","\/var\/lib\/openqa\/share\/factory\/hdd\/openSUSE-Tumbleweed-JeOS.x86_64-kvm-and-xen.qcow2",0] 2020-08-11 10:31:49 20 2020-08-11 10:31:49 2020-08-11 10:31:49

(10 rows)

And the gru_dependencies result is

job_id gru_task_id
195 5284
194 5283
193 5282
192 5281
191 5280
190 5279
189 5278
188 5277
187 5276
186 5275

(10 rows)

#22 Updated by Xiaojing_liu 5 months ago

  • Category changed from Concrete Bugs to Feature requests

#23 Updated by mkittler 5 months ago

Ok, so that would create multiple asset downloads. Judging by the code in openQA/lib/OpenQA/Task/Asset/Download.pm nothing bad will happen in that case. There's a lock to prevent concurrently downloading the same asset and a check to prevent downloading an existing asset again.

I suppose it would nevertheless be a good idea to de-duplicate the download lists for the jobs created by isos post by the download destination to produce less overhead. (E.g. enqueue_download_jobs would accept multiple download lists and skips "visited" download destinations.)

#24 Updated by Xiaojing_liu 5 months ago

  • Status changed from Workable to In Progress
  • Assignee set to Xiaojing_liu

#25 Updated by favogt 5 months ago

mkittler wrote:

@fvogt We came to the conclusion that this issue is actually: Add support for triggering GRU asset downloads via "isos post" when the relevant _URL parameters are not directly provided but only pulled from e.g. the test suites table

Is that right? I'm just wondering because in the ticket description this feature request is mixed up with restarting jobs and errors reported from the worker's asset cache.

Yes. I don't see how there's any mixup, the errors and observations with job restarting are directly related.

#26 Updated by Xiaojing_liu 5 months ago

  • Status changed from In Progress to Feedback

PR has been merged

#27 Updated by favogt 5 months ago

  • Status changed from Feedback to Resolved

I can confirm that the issue is fixed, thanks!

I created a new test suite and hooked it up, but unfortunately it bumps against max_redirects now. I opened a PR for that: https://github.com/os-autoinst/openQA/pull/3338

#28 Updated by mkittler 5 months ago

There's just one caveat but for now I wouldn't over-optimize it.

#29 Updated by favogt 5 months ago

  • Status changed from Resolved to Workable

Unfortunately there is an issue with the way this is implemented.

I added a testsuite ("jeos2twnext") which defines HDD_1_URL and linked them to the JeOS medium in the "Development Tumbleweed" group.
When the next snapshot scheduled the JeOS product, all unrelated tests in the main group also failed with the GRU error:
https://openqa.opensuse.org/tests/1375735#step/GRU/1

So it appears like the download jobs are attached to all scheduled jobs and not just the ones which actually need them?

#30 Updated by Xiaojing_liu 5 months ago

favogt wrote:

Unfortunately there is an issue with the way this is implemented.

I added a testsuite ("jeos2twnext") which defines HDD_1_URL and linked them to the JeOS medium in the "Development Tumbleweed" group.
When the next snapshot scheduled the JeOS product, all unrelated tests in the main group also failed with the GRU error:
https://openqa.opensuse.org/tests/1375735#step/GRU/1

So it appears like the download jobs are attached to all scheduled jobs and not just the ones which actually need them?

yes, when using isos post, the download jobs are attached to all scheduled jobs, even the unrelated jobs. This pr does not fix this, I could create a new ticket to record this feature.

#31 Updated by Xiaojing_liu 5 months ago

  • Related to action #70687: Download gru is attached to all scheduled jobs when doing 'isos post' added

#32 Updated by Xiaojing_liu 5 months ago

  • Status changed from Workable to Feedback

#33 Updated by okurz 4 months ago

@Xiaojing_liu I guess by now we know that your change works and all changes left to be done are tracked in #70687 , right? If this as true then please resolve this ticket.

#34 Updated by Xiaojing_liu 4 months ago

  • Status changed from Feedback to Resolved

#35 Updated by okurz 4 months ago

  • Related to action #72142: Avoid problematic symlinking in download assets tasks of the web UI added

Also available in: Atom PDF