test fails in partitioning in dual_windows10 - something must been changed in openqa regarding windows10 image or settings
openQA test in scenario opensuse-Tumbleweed-DVD-x86_64-kde_dual_windows10@uefi_win fails in
I've filed https://bugzilla.suse.com/show_bug.cgi?id=1172071 at the first place, then I'm afraid this is actual an openqa issue.
I've through product changes in the last few check-in and cannot find any suspicion change, and at the same moment this test also fails to work on Leap 15.2, I'm believing this is about something has been changed on openqa regarding windows 10 image or settings. A further clue is that I've re-try previous succeeded openqa job and now it turns to fail.
https://openqa.opensuse.org/tests/1278961 is re-runed test from the previous succeeded job, it's now ends to fail.
Fails since (at least) Build 20200523
Last good: 20200520 (or more recent)
Always latest result in this scenario: latest
#1 Updated by mlin7442 about 3 years ago
- Priority changed from Normal to High
#2 Updated by riafarov almost 3 years ago
Seems problem is in mechanism hiding qcow2 from downloading. If I try to download asset from openQA it's indeed 100KB file. So seems that openQA cannot download it properly.
o3 contains correct qcow2. However, permissions were wrong (set to root), I've changed ownership to geekotest, as it's supposed to be, let's see if that helps, but I guess it's more complex than this.
#3 Updated by mlin7442 almost 3 years ago
looks still not working https://openqa.opensuse.org/tests/1291031
#4 Updated by riafarov almost 3 years ago
Can someone from tools team comment on this?
#5 Updated by dimstar almost 3 years ago
openQA redirects download attempts of the win images to microsoft - as we can't legally distribute' those.
it redirects all users not coming from the worker network (i.e not 192.168.112.0/24)
So far, all good.
Now, though, I did find a problem on ariel. the win qcow image exists in factory/hdd AND factory/hdd/fixed
-rw-r--r-- 1 geekotest nogroup 102228 May 25 13:16 windows-10-x86_64-1903@uefi_win.qcow2
-rw-r--r-- 1 geekotest nogroup 5450498048 Sep 26 2019 windows-10-x86_64-1903@uefi_win.qcow2
Clearly, the one in factory/hdd is not correct - but seems to be the preferred one over the image in fixed. As a test, I renamed it to
windows-10-x86_64-1903@uefi_win.qcow2~ to ignore it for now.
Test run: https://openqa.opensuse.org/tests/1295279 -> passed partitioner
So remains only to find out where from this broken qcow image came on May 25
#6 Updated by okurz almost 3 years ago
- Due date set to 2020-06-17
- Status changed from New to Feedback
- Assignee set to okurz
Unlikely we can find out what caused this. Looking in the database I can find:
openqa=> select jobs.id,t_finished,test from jobs,job_settings where (jobs.test ~ 'windows' and job_settings.job_id = jobs.id and key = 'PUBLISH_HDD_1' and value = 'windows-10-x86_64-1903@uefi_win.qcow2'); id | t_finished | test ---------+---------------------+------------ 1036580 | 2019-09-20 10:57:15 | windows_10 (1 row)
so a single job but that is much older – about the age of the actual fixed asset – and also https://openqa.opensuse.org/tests/1036580/file/worker-log.txt shows what looks like a "longer" upload corresponding to a file that is way bigger than 100kb. So I guess someone did a mistake, triggered one job, maybe aborted it prematurely, etc. Maybe we can just regard it as unlucky timing that caused it to end up in a way that is not completely obvious :D
In hindsight the wrong permissions might also be a symptom of "prematurely aborted upload" as it might be that in the correct case the file should change its ownership to geekotest. But could also be someone doing stuff manually. Overall the story looks related to #67219 .
So I think the immediate problem is fixed. I will take the ticket and try to use the opportunity for all of us involved to learn and see how we can improve in the future to maybe not prevent case like these but improve so that the next time we spend less time and effort to identify the root cause.
I have one finding: https://openqa.opensuse.org/tests/1277483 is the first job in the row that failed. maxlin reviewed and reported the bug on bugzilla. What could have helped is the initial investigation bisection step to distinguish "1. is it reproducible, 2. does the same test with test code of 'last good' still work, 3. does the same test with product state of 'last good' still work.". https://gitlab.suse.de/openqa/auto-review/pipelines is setup for that by triggering automatic investigation jobs for every new failures that do not yet have a comment. There was however unfortunate timing as the pipeline triggers every day at 0819 CET and maxlin commented at just 0759 CET so 20mins before :D The specific review job in question is https://gitlab.suse.de/openqa/auto-review/-/jobs/210675
Hence I have one simple suggestion: Use https://github.com/os-autoinst/scripts/blob/master/openqa-investigate for any new openQA test failures where the root cause is not immediately obvious
I am looking forward for more comments from all of you
#7 Updated by okurz almost 3 years ago
- Status changed from Feedback to Resolved
I think the "investigation" route provided by openQA same as automatically triggered investigation jobs would at least show that there is no relevant difference so that should lead one to the conclusion that it is neither test differences nor product differences. adding checksum sounds feasible same as crosschecking the size of the image. IMHO we should calculate and check and show the checksum of generated/used assets, especially for "fixed" assets. Recorded the idea in #65271#note-19