Project

General

Profile

Actions

action #120193

closed

[tools] Test schedule ordering: tests run before repo available "Test died: 'zypper -n ref' failed with code 4"

Added by dimstar about 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Infrastructure
Target version:
Start date:
2022-11-09
Due date:
% Done:

100%

Estimated time:
Difficulty:

Description

Observation

This issue is observed quite often - but not daily

Some tests get started before the NET media and repo are synced and registered. Without the two available, zypper refresh fails in all tests (tests explicitly switch to the QArepo)

Waiting for the full schedule to be ready, then rerunning the tests, make them pass (but i should not have to do that)

openQA test in scenario opensuse-Tumbleweed-JeOS-for-kvm-and-xen-x86_64-jeos-ltp-commands@uefi_virtio-2G fails in
zypper_ref

Steps to reproduce

Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
openqa-query-for-job-label poo#120193

Test suite description

backup: LTP_COMMAND_EXCLUDE=tar01_sh|logrotate_sh|unzip01_sh|df01_._sh|sysctl01_sh|mkfs01._sh|which01_sh|insmod01_sh

Reproducible

Fails since (at least) Build 20221109 (current job)

Expected result

Last good: 20221108 (or more recent)

Suggestions

Further details

Always latest result in this scenario: latest

Workaround

Retrigger the failing tests as the repos will likely have finished syncing necessary assets in the second run

Actions #1

Updated by maritawerner about 2 years ago

  • Subject changed from Test schedule ordering: tests run before repo available to [qe-core] Test schedule ordering: tests run before repo available
Actions #2

Updated by okurz about 2 years ago

  • Subject changed from [qe-core] Test schedule ordering: tests run before repo available to [tools] Test schedule ordering: tests run before repo available
  • Category changed from Bugs in existing tests to Infrastructure
  • Status changed from New to In Progress
  • Assignee set to okurz
  • Priority changed from Normal to High
  • Target version set to Ready

I will check if assets are correctly specified and the quota is big enough

Actions #3

Updated by okurz about 2 years ago

Ok. Tests pass after retriggering so eventually the assets are synced and also they stay long enough to execute tests, so it's not about quotas too small. My hypothesis is that either assets are not correctly specified so that jobs don't have an explicit relation to the assets or that syncing&triggering within openqa-trigger-from-obs is not atomic or at least not happening in the right order for the medium

Actions #4

Updated by okurz about 2 years ago

  • Status changed from In Progress to Feedback

Ok. Tests pass after retriggering so eventually the assets are synced and also they stay long enough to execute tests, so it's not about quotas too small. My hypothesis is that either assets are not correctly specified so that jobs don't have an explicit relation to the assets or that syncing&triggering within openqa-trigger-from-obs is not atomic or at least not happening in the right order for the medium

As visible on https://openqa.opensuse.org/admin/obs_rsync/openSUSE:Factory:ToTest the syncs that happened are:
openSUSE:Factory:ToTest|base 221206_061932_20221205
openSUSE:Factory:ToTest|jeos 221206_061801_20221205

where "…|base" includes the repo sync and "…|jeos" apparently relies on those but the jeos sync is concluded before base concludes syncing so the jeos tests might start and actually reach a point where they need repos when the repos have not concluded syncing and hence fail.

https://openqa.opensuse.org/minion/jobs?id=1992244 shows that the above timestamp is actually the starting time. The syncing finished at 2022-12-06T06:26:00.054607Z so a 8m window in where tests could have started but when the repo would not be ready. The last failure seems to be from yesterday. Likely the corresponding minion job executing the sync of repos was https://openqa.opensuse.org/minion/jobs?id=1988493 showing a duration of 19m until 00:29:00Z when the repo sync was done. But a jeos test https://openqa.opensuse.org/tests/2927376/logfile?filename=autoinst-log.txt already failed at 00:11:31Z trying to access repos which were not completely synced at that time, maybe have not even been started to sync yet.

I assume the problem stems from the "batch" nodes in https://github.com/os-autoinst/openqa-trigger-from-obs/blob/master/xml/obs/openSUSE:Factory.xml not being guaranteed to be executed in the order as apparent in the file. And also I don't think we guarantee any order of execution and definitely not order of completion in minion jobs. So I see the following options:

  1. Specify the necessary repo settings as part of each "batch" node that relies on it. disadvantage hard to maintain duplication
  2. Combine the multiple flavor-nodes into one combined batch section if possible
  3. Ensure the execution of sync&trigger to be serialized and executed in the order as specified in the xml files
  4. Artificially pause&delay the tests relying on the repos until all data can be found. advantage this can be done in os-autoinst-distri-opensuse not needing more knowledge about openqa-trigger-from-obs, disadvantage the approach would be a workaround at a single place needing maybe to be applied at other places as well whereas the actual problem comes from openqa-trigger-from-obs

Asking others for help in https://suse.slack.com/archives/C02CANHLANP/p1670310778419839

Actions #5

Updated by okurz about 2 years ago

  • Due date set to 2022-12-20

https://github.com/os-autoinst/openqa-trigger-from-obs/pull/188 from anikitin. I will monitor over the next days if it helps.

Actions #6

Updated by okurz about 2 years ago

  • Subject changed from [tools] Test schedule ordering: tests run before repo available to [tools] Test schedule ordering: tests run before repo available auto_review:"Test died: 'zypper -n ref' failed with code 4":retry
  • Description updated (diff)

Seems like no related issues showed up today. Will monitor using auto-review over the next days.

Actions #7

Updated by livdywan about 2 years ago

  • Subject changed from [tools] Test schedule ordering: tests run before repo available auto_review:"Test died: 'zypper -n ref' failed with code 4":retry to [tools] Test schedule ordering: tests run before repo available auto_review:"Test died: 'zypper -n ref' failed with code 4":retry size:M
Actions #8

Updated by livdywan about 2 years ago

okurz wrote:

https://github.com/os-autoinst/openqa-trigger-from-obs/pull/188 from anikitin. I will monitor over the next days if it helps.

Unfortunately it looks like we're still getting hits on openqa-query-for-job-label poo#120193 e.g. https://openqa.opensuse.org/tests/2963108#step/zypper_ref/61 or https://openqa.opensuse.org/tests/2963155#step/zypper_ref/61

Actions #9

Updated by livdywan about 2 years ago

  • Due date changed from 2022-12-20 to 2023-01-06

Let's revisit the remaining issues in the new year.

Actions #10

Updated by okurz about 2 years ago

  • Due date changed from 2023-01-06 to 2023-01-20

christmas grace due date bump :)

Actions #11

Updated by okurz about 2 years ago

$ openqa-query-for-job-label poo#120193
2970888|2022-12-22 12:17:15|done|failed|upgrade_Leap_15.3_kde:investigate:last_good_tests_and_build:4666ea5fd861b1096ce31d0ab27eb26405e52a79+343.1||openqaworker20
2970836|2022-12-22 12:00:13|done|failed|upgrade_Leap_15.3_kde:investigate:last_good_build:343.1||openqaworker19
2970654|2022-12-22 10:46:57|done|failed|upgrade_Leap_15.3_kde:investigate:last_good_tests_and_build:4666ea5fd861b1096ce31d0ab27eb26405e52a79+343.1||openqaworker4
2970665|2022-12-22 10:30:35|done|failed|upgrade_Leap_15.3_kde:investigate:last_good_build:343.1||openqaworker19
2970626|2022-12-22 09:20:01|done|failed|gnome:investigate:last_good_tests_and_build:8ddff33b87479593b1b940f7ec23f9bb9ea1e030+20221220||qa-power8-3
2970563|2022-12-22 09:01:55|done|failed|upgrade_Leap_15.3_kde:investigate:last_good_build:343.1||openqaworker1
2970605|2022-12-22 08:39:19|done|failed|gnome:investigate:last_good_tests_and_build:8ddff33b87479593b1b940f7ec23f9bb9ea1e030+20221220||qa-power8-3
2970598|2022-12-22 08:34:36|done|failed|gnome:investigate:last_good_build:20221220||qa-power8-3
2970414|2022-12-22 07:55:11|done|failed|gnome:investigate:last_good_tests_and_build:8ddff33b87479593b1b940f7ec23f9bb9ea1e030+20221220||qa-power8-3
2970413|2022-12-22 07:52:17|done|failed|gnome:investigate:last_good_build:20221220||qa-power8-3
10229590|2022-12-22 13:15:48|done|failed|slem_containers_selinux||openqaworker-arm-3
10229587|2022-12-22 13:12:13|done|failed|slem_containers||openqaworker-arm-1
10229554|2022-12-22 12:49:19|done|failed|slem_containers_selinux||openqaworker-arm-2
10229551|2022-12-22 12:48:20|done|failed|slem_selinux||openqaworker-arm-3
10229570|2022-12-22 12:43:41|done|failed|slem_virtualization||openqaworker-arm-1
10229540|2022-12-22 12:36:43|done|failed|slem_containers||openqaworker-arm-1
10229539|2022-12-22 12:34:09|done|failed|slem_image_default||openqaworker-arm-2
10229538|2022-12-22 12:31:56|done|failed|slem_migration_5.1_to_5.2||openqaworker-arm-2
10229537|2022-12-22 12:30:49|done|failed|slem_virtualization:investigate:bisect_without_27245||openqaworker-arm-1
10229569|2022-12-22 11:53:01|done|failed|slem_installation_default||worker2

@Andrii Nikitin how do you understand https://openqa.opensuse.org/tests/2970888#step/zypper_ref/73 and do you think it can be related to #120193 which should have been supposedly fixed by https://github.com/os-autoinst/openqa-trigger-from-obs/pull/188

Actions #12

Updated by okurz about 2 years ago

  • Subject changed from [tools] Test schedule ordering: tests run before repo available auto_review:"Test died: 'zypper -n ref' failed with code 4":retry size:M to [tools] Test schedule ordering: tests run before repo available "Test died: 'zypper -n ref' failed with code 4" size:M
  • Due date deleted (2023-01-20)
  • Status changed from Feedback to Resolved

The first ones are all Leap investigation builds trying to access the former build of Leap which maybe was already cleaned up by the asset cleanup. That would mean that simple cleanup deleted intermediate and older repos despite https://openqa.opensuse.org/group_overview/50 being configured for 1.2TB of asset size. I bumped that now to a little bit highet 1.4TB. And OSD tests like https://openqa.suse.de/tests/10229569#step/install_updates/90 fail due to a different problem with certificates which someone moved to a Confluence page (!), related thread https://suse.slack.com/archives/C029APBKLGK/p1671712571096889. Meaning that we have not found any more recent problems for this exact ticket so considering resolved.

Actions #14

Updated by okurz about 2 years ago

  • Subject changed from [tools] Test schedule ordering: tests run before repo available "Test died: 'zypper -n ref' failed with code 4" size:M to [tools] Test schedule ordering: tests run before repo available "Test died: 'zypper -n ref' failed with code 4"
  • Status changed from In Progress to New
  • Assignee deleted (okurz)

Ok, that looks like still the original problem of tests being triggered before the snapshot repos were synced, right?

Actions #15

Updated by dimstar about 2 years ago

okurz wrote:

Ok, that looks like still the original problem of tests being triggered before the snapshot repos were synced, right?

Indeed; retriggering the test later (after all jobs appeared) made them pass without problem

Actions #16

Updated by okurz about 2 years ago

  • Description updated (diff)
  • Assignee set to andriinikitin
  • Target version changed from Ready to future

@andriinikitin can you please

Actions #17

Updated by andriinikitin about 2 years ago

okurz wrote:

@andriinikitin can you please

I guess this discussion on slack got lost during holidays https://suse.slack.com/archives/C02CANHLANP/p1671715130965669
Summary: build 343.1 was 21 days old at the moment of failure and was probably cleaned by asset management job. Previous runs of that test with that build were successful, so the failure isn't related to the problem, which https://github.com/os-autoinst/openqa-trigger-from-obs/pull/188 was addressing.

Actions #18

Updated by okurz about 2 years ago

  • Tags set to reactive work
Actions #19

Updated by andriinikitin about 2 years ago

Further improve consistency checks with https://github.com/os-autoinst/openqa-trigger-from-obs/pull/196 .
It is tricky to write a proper test for it, but it should help this time.

Actions #20

Updated by slo-gin almost 2 years ago

This ticket was set to High priority but was not updated within the SLO period. Please consider picking up this ticket or just set the ticket to the next lower priority.

Actions #21

Updated by andriinikitin almost 2 years ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100

I assume it is resolved with the last note about commit

Actions

Also available in: Atom PDF