action #120193
closed[tools] Test schedule ordering: tests run before repo available "Test died: 'zypper -n ref' failed with code 4"
100%
Description
Observation¶
This issue is observed quite often - but not daily
Some tests get started before the NET media and repo are synced and registered. Without the two available, zypper refresh fails in all tests (tests explicitly switch to the QArepo)
Waiting for the full schedule to be ready, then rerunning the tests, make them pass (but i should not have to do that)
openQA test in scenario opensuse-Tumbleweed-JeOS-for-kvm-and-xen-x86_64-jeos-ltp-commands@uefi_virtio-2G fails in
zypper_ref
Steps to reproduce¶
Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
openqa-query-for-job-label poo#120193
Test suite description¶
backup: LTP_COMMAND_EXCLUDE=tar01_sh|logrotate_sh|unzip01_sh|df01_._sh|sysctl01_sh|mkfs01._sh|which01_sh|insmod01_sh
Reproducible¶
Fails since (at least) Build 20221109 (current job)
Expected result¶
Last good: 20221108 (or more recent)
Suggestions¶
- Look into https://progress.opensuse.org/issues/120193#note-11 and understand why https://github.com/os-autoinst/openqa-trigger-from-obs/pull/188 apparently could not fix the problem
- Discuss with Andrii Nikitin and DimStar what to do. At best ask Andrii to fix it :)
- Remind about the suggestions in https://progress.opensuse.org/issues/120193#note-4
- Alternative: Find a better approach within openQA
Further details¶
Always latest result in this scenario: latest
Workaround¶
Retrigger the failing tests as the repos will likely have finished syncing necessary assets in the second run
Updated by maritawerner almost 2 years ago
- Subject changed from Test schedule ordering: tests run before repo available to [qe-core] Test schedule ordering: tests run before repo available
Updated by okurz almost 2 years ago
- Subject changed from [qe-core] Test schedule ordering: tests run before repo available to [tools] Test schedule ordering: tests run before repo available
- Category changed from Bugs in existing tests to Infrastructure
- Status changed from New to In Progress
- Assignee set to okurz
- Priority changed from Normal to High
- Target version set to Ready
I will check if assets are correctly specified and the quota is big enough
Updated by okurz almost 2 years ago
Ok. Tests pass after retriggering so eventually the assets are synced and also they stay long enough to execute tests, so it's not about quotas too small. My hypothesis is that either assets are not correctly specified so that jobs don't have an explicit relation to the assets or that syncing&triggering within openqa-trigger-from-obs is not atomic or at least not happening in the right order for the medium
Updated by okurz almost 2 years ago
- Status changed from In Progress to Feedback
Ok. Tests pass after retriggering so eventually the assets are synced and also they stay long enough to execute tests, so it's not about quotas too small. My hypothesis is that either assets are not correctly specified so that jobs don't have an explicit relation to the assets or that syncing&triggering within openqa-trigger-from-obs is not atomic or at least not happening in the right order for the medium
As visible on https://openqa.opensuse.org/admin/obs_rsync/openSUSE:Factory:ToTest the syncs that happened are:
openSUSE:Factory:ToTest|base 221206_061932_20221205
openSUSE:Factory:ToTest|jeos 221206_061801_20221205
where "…|base" includes the repo sync and "…|jeos" apparently relies on those but the jeos sync is concluded before base concludes syncing so the jeos tests might start and actually reach a point where they need repos when the repos have not concluded syncing and hence fail.
https://openqa.opensuse.org/minion/jobs?id=1992244 shows that the above timestamp is actually the starting time. The syncing finished at 2022-12-06T06:26:00.054607Z so a 8m window in where tests could have started but when the repo would not be ready. The last failure seems to be from yesterday. Likely the corresponding minion job executing the sync of repos was https://openqa.opensuse.org/minion/jobs?id=1988493 showing a duration of 19m until 00:29:00Z when the repo sync was done. But a jeos test https://openqa.opensuse.org/tests/2927376/logfile?filename=autoinst-log.txt already failed at 00:11:31Z trying to access repos which were not completely synced at that time, maybe have not even been started to sync yet.
I assume the problem stems from the "batch" nodes in https://github.com/os-autoinst/openqa-trigger-from-obs/blob/master/xml/obs/openSUSE:Factory.xml not being guaranteed to be executed in the order as apparent in the file. And also I don't think we guarantee any order of execution and definitely not order of completion in minion jobs. So I see the following options:
- Specify the necessary repo settings as part of each "batch" node that relies on it. disadvantage hard to maintain duplication
- Combine the multiple flavor-nodes into one combined batch section if possible
- Ensure the execution of sync&trigger to be serialized and executed in the order as specified in the xml files
- Artificially pause&delay the tests relying on the repos until all data can be found. advantage this can be done in os-autoinst-distri-opensuse not needing more knowledge about openqa-trigger-from-obs, disadvantage the approach would be a workaround at a single place needing maybe to be applied at other places as well whereas the actual problem comes from openqa-trigger-from-obs
Asking others for help in https://suse.slack.com/archives/C02CANHLANP/p1670310778419839
Updated by okurz almost 2 years ago
- Due date set to 2022-12-20
https://github.com/os-autoinst/openqa-trigger-from-obs/pull/188 from anikitin. I will monitor over the next days if it helps.
Updated by okurz almost 2 years ago
- Subject changed from [tools] Test schedule ordering: tests run before repo available to [tools] Test schedule ordering: tests run before repo available auto_review:"Test died: 'zypper -n ref' failed with code 4":retry
- Description updated (diff)
Seems like no related issues showed up today. Will monitor using auto-review over the next days.
Updated by livdywan almost 2 years ago
- Subject changed from [tools] Test schedule ordering: tests run before repo available auto_review:"Test died: 'zypper -n ref' failed with code 4":retry to [tools] Test schedule ordering: tests run before repo available auto_review:"Test died: 'zypper -n ref' failed with code 4":retry size:M
Updated by livdywan almost 2 years ago
okurz wrote:
https://github.com/os-autoinst/openqa-trigger-from-obs/pull/188 from anikitin. I will monitor over the next days if it helps.
Unfortunately it looks like we're still getting hits on openqa-query-for-job-label poo#120193
e.g. https://openqa.opensuse.org/tests/2963108#step/zypper_ref/61 or https://openqa.opensuse.org/tests/2963155#step/zypper_ref/61
Updated by livdywan almost 2 years ago
- Due date changed from 2022-12-20 to 2023-01-06
Let's revisit the remaining issues in the new year.
Updated by okurz over 1 year ago
- Due date changed from 2023-01-06 to 2023-01-20
christmas grace due date bump :)
Updated by okurz over 1 year ago
$ openqa-query-for-job-label poo#120193
2970888|2022-12-22 12:17:15|done|failed|upgrade_Leap_15.3_kde:investigate:last_good_tests_and_build:4666ea5fd861b1096ce31d0ab27eb26405e52a79+343.1||openqaworker20
2970836|2022-12-22 12:00:13|done|failed|upgrade_Leap_15.3_kde:investigate:last_good_build:343.1||openqaworker19
2970654|2022-12-22 10:46:57|done|failed|upgrade_Leap_15.3_kde:investigate:last_good_tests_and_build:4666ea5fd861b1096ce31d0ab27eb26405e52a79+343.1||openqaworker4
2970665|2022-12-22 10:30:35|done|failed|upgrade_Leap_15.3_kde:investigate:last_good_build:343.1||openqaworker19
2970626|2022-12-22 09:20:01|done|failed|gnome:investigate:last_good_tests_and_build:8ddff33b87479593b1b940f7ec23f9bb9ea1e030+20221220||qa-power8-3
2970563|2022-12-22 09:01:55|done|failed|upgrade_Leap_15.3_kde:investigate:last_good_build:343.1||openqaworker1
2970605|2022-12-22 08:39:19|done|failed|gnome:investigate:last_good_tests_and_build:8ddff33b87479593b1b940f7ec23f9bb9ea1e030+20221220||qa-power8-3
2970598|2022-12-22 08:34:36|done|failed|gnome:investigate:last_good_build:20221220||qa-power8-3
2970414|2022-12-22 07:55:11|done|failed|gnome:investigate:last_good_tests_and_build:8ddff33b87479593b1b940f7ec23f9bb9ea1e030+20221220||qa-power8-3
2970413|2022-12-22 07:52:17|done|failed|gnome:investigate:last_good_build:20221220||qa-power8-3
10229590|2022-12-22 13:15:48|done|failed|slem_containers_selinux||openqaworker-arm-3
10229587|2022-12-22 13:12:13|done|failed|slem_containers||openqaworker-arm-1
10229554|2022-12-22 12:49:19|done|failed|slem_containers_selinux||openqaworker-arm-2
10229551|2022-12-22 12:48:20|done|failed|slem_selinux||openqaworker-arm-3
10229570|2022-12-22 12:43:41|done|failed|slem_virtualization||openqaworker-arm-1
10229540|2022-12-22 12:36:43|done|failed|slem_containers||openqaworker-arm-1
10229539|2022-12-22 12:34:09|done|failed|slem_image_default||openqaworker-arm-2
10229538|2022-12-22 12:31:56|done|failed|slem_migration_5.1_to_5.2||openqaworker-arm-2
10229537|2022-12-22 12:30:49|done|failed|slem_virtualization:investigate:bisect_without_27245||openqaworker-arm-1
10229569|2022-12-22 11:53:01|done|failed|slem_installation_default||worker2
@Andrii Nikitin how do you understand https://openqa.opensuse.org/tests/2970888#step/zypper_ref/73 and do you think it can be related to #120193 which should have been supposedly fixed by https://github.com/os-autoinst/openqa-trigger-from-obs/pull/188
Updated by okurz over 1 year ago
- Subject changed from [tools] Test schedule ordering: tests run before repo available auto_review:"Test died: 'zypper -n ref' failed with code 4":retry size:M to [tools] Test schedule ordering: tests run before repo available "Test died: 'zypper -n ref' failed with code 4" size:M
- Due date deleted (
2023-01-20) - Status changed from Feedback to Resolved
The first ones are all Leap investigation builds trying to access the former build of Leap which maybe was already cleaned up by the asset cleanup. That would mean that simple cleanup deleted intermediate and older repos despite https://openqa.opensuse.org/group_overview/50 being configured for 1.2TB of asset size. I bumped that now to a little bit highet 1.4TB. And OSD tests like https://openqa.suse.de/tests/10229569#step/install_updates/90 fail due to a different problem with certificates which someone moved to a Confluence page (!), related thread https://suse.slack.com/archives/C029APBKLGK/p1671712571096889. Meaning that we have not found any more recent problems for this exact ticket so considering resolved.
Updated by dimstar over 1 year ago
- Status changed from Resolved to In Progress
I disagree - seems your script does not find anything which a human might have restarted already?
From today's snapshot:
- https://openqa.opensuse.org/tests/2969502#step/zypper_ref/73
- https://openqa.opensuse.org/tests/2969503#step/zypper_ref/73
- https://openqa.opensuse.org/tests/2969504#step/zypper_ref/73
- https://openqa.opensuse.org/tests/2969507#step/zypper_ref/73
- https://openqa.opensuse.org/tests/2969505#step/zypper_ref/73
- https://openqa.opensuse.org/tests/2969506#step/zypper_ref/73
And I'm sure there were more
Updated by okurz over 1 year ago
- Subject changed from [tools] Test schedule ordering: tests run before repo available "Test died: 'zypper -n ref' failed with code 4" size:M to [tools] Test schedule ordering: tests run before repo available "Test died: 'zypper -n ref' failed with code 4"
- Status changed from In Progress to New
- Assignee deleted (
okurz)
Ok, that looks like still the original problem of tests being triggered before the snapshot repos were synced, right?
Updated by dimstar over 1 year ago
okurz wrote:
Ok, that looks like still the original problem of tests being triggered before the snapshot repos were synced, right?
Indeed; retriggering the test later (after all jobs appeared) made them pass without problem
Updated by okurz over 1 year ago
- Description updated (diff)
- Assignee set to andriinikitin
- Target version changed from Ready to future
@andriinikitin can you please
- Look into #120193#note-11 and understand why https://github.com/os-autoinst/openqa-trigger-from-obs/pull/188 apparently could not fix the problem
- Look into the suggestions in #120193#note-4
Updated by andriinikitin over 1 year ago
okurz wrote:
@andriinikitin can you please
- Look into #120193#note-11 and understand why https://github.com/os-autoinst/openqa-trigger-from-obs/pull/188 apparently could not fix the problem
- Look into the suggestions in #120193#note-4
I guess this discussion on slack got lost during holidays https://suse.slack.com/archives/C02CANHLANP/p1671715130965669
Summary: build 343.1 was 21 days old at the moment of failure and was probably cleaned by asset management job. Previous runs of that test with that build were successful, so the failure isn't related to the problem, which https://github.com/os-autoinst/openqa-trigger-from-obs/pull/188 was addressing.
Updated by andriinikitin over 1 year ago
Further improve consistency checks with https://github.com/os-autoinst/openqa-trigger-from-obs/pull/196 .
It is tricky to write a proper test for it, but it should help this time.
Updated by slo-gin over 1 year ago
This ticket was set to High priority but was not updated within the SLO period. Please consider picking up this ticket or just set the ticket to the next lower priority.
Updated by andriinikitin over 1 year ago
- Status changed from New to Resolved
- % Done changed from 0 to 100
I assume it is resolved with the last note about commit