action #67288: test fails in partitioning in dual_windows10 - something must been changed in openqa regarding windows10 image or settings - openQA Tests - openSUSE Project Management Tool

Actions

Copy link

action #67288

closed

test fails in partitioning in dual_windows10 - something must been changed in openqa regarding windows10 image or settings

Added by mlin7442 about 4 years ago. Updated about 4 years ago.

Status:

Resolved

Priority:

High

Assignee:

okurz

Category:

Bugs in existing tests

Target version:

Start date:

2020-05-26

Due date:

2020-06-17

% Done:

Estimated time:

Difficulty:

Description

Observation¶

openQA test in scenario opensuse-Tumbleweed-DVD-x86_64-kde_dual_windows10@uefi_win fails in
partitioning

I've filed https://bugzilla.suse.com/show_bug.cgi?id=1172071 at the first place, then I'm afraid this is actual an openqa issue.

I've through product changes in the last few check-in and cannot find any suspicion change, and at the same moment this test also fails to work on Leap 15.2, I'm believing this is about something has been changed on openqa regarding windows 10 image or settings. A further clue is that I've re-try previous succeeded openqa job and now it turns to fail.

https://openqa.opensuse.org/tests/1278961 is re-runed test from the previous succeeded job, it's now ends to fail.

Reproducible¶

Fails since (at least) Build 20200523

Expected result¶

Last good: 20200520 (or more recent)

Further details¶

Always latest result in this scenario: latest

Actions

Copy link

Updated by mlin7442 about 4 years ago

Priority changed from Normal to High

Actions

Copy link

Updated by riafarov about 4 years ago

Seems problem is in mechanism hiding qcow2 from downloading. If I try to download asset from openQA it's indeed 100KB file. So seems that openQA cannot download it properly.
o3 contains correct qcow2. However, permissions were wrong (set to root), I've changed ownership to geekotest, as it's supposed to be, let's see if that helps, but I guess it's more complex than this.

Actions

Copy link

Updated by mlin7442 about 4 years ago

looks still not working https://openqa.opensuse.org/tests/1291031

Actions

Copy link

Updated by riafarov about 4 years ago

Can someone from tools team comment on this?

Actions

Copy link

Updated by dimstar about 4 years ago

openQA redirects download attempts of the win images to microsoft - as we can't legally distribute' those.

it redirects all users not coming from the worker network (i.e not 192.168.112.0/24)
So far, all good.

Now, though, I did find a problem on ariel. the win qcow image exists in factory/hdd AND factory/hdd/fixed

in factory/hdd

-rw-r--r-- 1 geekotest nogroup     102228 May 25 13:16 windows-10-x86_64-1903@uefi_win.qcow2

in factory/hdd/fixed/

-rw-r--r-- 1 geekotest   nogroup  5450498048 Sep 26  2019 windows-10-x86_64-1903@uefi_win.qcow2

Clearly, the one in factory/hdd is not correct - but seems to be the preferred one over the image in fixed. As a test, I renamed it to
windows-10-x86_64-1903@uefi_win.qcow2~ to ignore it for now.

Test run: https://openqa.opensuse.org/tests/1295279 -> passed partitioner

So remains only to find out where from this broken qcow image came on May 25

Actions

Copy link

Updated by okurz about 4 years ago

Due date set to 2020-06-17
Status changed from New to Feedback
Assignee set to okurz

Unlikely we can find out what caused this. Looking in the database I can find:

openqa=> select jobs.id,t_finished,test from jobs,job_settings where (jobs.test ~ 'windows' and job_settings.job_id = jobs.id and key = 'PUBLISH_HDD_1' and value = 'windows-10-x86_64-1903@uefi_win.qcow2');
   id    |     t_finished      |    test    
---------+---------------------+------------
 1036580 | 2019-09-20 10:57:15 | windows_10
(1 row)

so a single job but that is much older – about the age of the actual fixed asset – and also https://openqa.opensuse.org/tests/1036580/file/worker-log.txt shows what looks like a "longer" upload corresponding to a file that is way bigger than 100kb. So I guess someone did a mistake, triggered one job, maybe aborted it prematurely, etc. Maybe we can just regard it as unlucky timing that caused it to end up in a way that is not completely obvious :D

In hindsight the wrong permissions might also be a symptom of "prematurely aborted upload" as it might be that in the correct case the file should change its ownership to geekotest. But could also be someone doing stuff manually. Overall the story looks related to #67219 .

So I think the immediate problem is fixed. I will take the ticket and try to use the opportunity for all of us involved to learn and see how we can improve in the future to maybe not prevent case like these but improve so that the next time we spend less time and effort to identify the root cause.

I have one finding: https://openqa.opensuse.org/tests/1277483 is the first job in the row that failed. maxlin reviewed and reported the bug on bugzilla. What could have helped is the initial investigation bisection step to distinguish "1. is it reproducible, 2. does the same test with test code of 'last good' still work, 3. does the same test with product state of 'last good' still work.". https://gitlab.suse.de/openqa/auto-review/pipelines is setup for that by triggering automatic investigation jobs for every new failures that do not yet have a comment. There was however unfortunate timing as the pipeline triggers every day at 0819 CET and maxlin commented at just 0759 CET so 20mins before :D The specific review job in question is https://gitlab.suse.de/openqa/auto-review/-/jobs/210675

Hence I have one simple suggestion: Use https://github.com/os-autoinst/scripts/blob/master/openqa-investigate for any new openQA test failures where the root cause is not immediately obvious

I am looking forward for more comments from all of you

Actions

Copy link

Updated by okurz about 4 years ago

Status changed from Feedback to Resolved

I think the "investigation" route provided by openQA same as automatically triggered investigation jobs would at least show that there is no relevant difference so that should lead one to the conclusion that it is neither test differences nor product differences. adding checksum sounds feasible same as crosschecking the size of the image. IMHO we should calculate and check and show the checksum of generated/used assets, especially for "fixed" assets. Recorded the idea in #65271#note-19

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA » openQA Project » openQA Tests

Tags

Custom queries

action #67288

test fails in partitioning in dual_windows10 - something must been changed in openqa regarding windows10 image or settings

Observation¶

Reproducible¶

Expected result¶

Further details¶

Updated by mlin7442 about 4 years ago

Updated by riafarov about 4 years ago

Updated by mlin7442 about 4 years ago

Updated by riafarov about 4 years ago

Updated by dimstar about 4 years ago

Updated by okurz about 4 years ago

Updated by okurz about 4 years ago