Project

General

Profile

Actions

action #136154

closed

coordination #102906: [saga][epic] Increased stability of tests with less "known failures", known incompletes handled automatically within openQA

multimachine tests restarted by RETRY test variable end up without the proper dependency size:M

Added by szarate over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Observation

I started noticing multiple jobs that are MM missing one or more dependencies:

Normally this job is a MM one, with two jobs https://openqa.suse.de/tests/12210430 -> should look like https://openqa.suse.de/tests/12207579#dependencies

In this case, the RETRY=1 makes the situation worse, causing blocked updates, due to jobs that should not have ever been restarted automatically, see https://openqa.suse.de/tests/12207609

Suggestions

  • Find a reproducing scenario with multi-machine clusters using RETRY=1
  • Create a simple MM cluster locally (maybe within unit tests are by adjusting the local database manually) and invoke the code that is done on an automatic retry (via RETRY=…), e.g. in t/10-jobs.t where we already use RETRY and take a look into t/05-scheduler-dependencies.t
  • Only then solve this problem in a mob session since only Marius is currently aware of how to do it

Related issues 2 (0 open2 closed)

Related to openQA Project (public) - action #152389: significant increase in MM-test failure ratio 2023-12-11: test fails in multipath_iscsi and other multi-machine scenarios due to MTU size auto_review:"ping with packet size 1350 failed, problems with MTU" size:MResolvedmkittler2023-12-11

Actions
Copied from openQA Project (public) - action #80264: multimachine tests unable to get vars from its pair jobResolvedmkittler2020-11-24

Actions
Actions #1

Updated by szarate over 1 year ago

  • Copied from action #80264: multimachine tests unable to get vars from its pair job added
Actions #2

Updated by szarate over 1 year ago

  • Parent task set to #102906

Setting a better looking parent task

Actions #3

Updated by okurz over 1 year ago

  • Assignee set to okurz

Hm, this has the same subject as the ticket clone source. I assume you forgot to update that?

Actions #4

Updated by okurz over 1 year ago

  • Priority changed from Urgent to High
  • Target version changed from Ready to Tools - Next
Actions #5

Updated by okurz over 1 year ago

  • Status changed from New to Rejected

no response. I assume the ticket was created by mistake.

Actions #6

Updated by szarate over 1 year ago

  • Subject changed from multimachine tests unable to get vars from its pair job to multimachine tests restarted end up without the proper dependency
  • Status changed from Rejected to Workable

no response. I assume the ticket was created by mistake.

Nope, more read the email but never replied :), fixed the title, ticket description has enough info

Actions #7

Updated by okurz over 1 year ago

  • Assignee deleted (okurz)
  • Priority changed from High to Normal
  • Start date deleted (2020-11-24)

Ok. We can plan to look into this ticket but if you can please add a link to "latest" and more information because otherwise the jobs you reference might be lost soon

Actions #8

Updated by livdywan over 1 year ago

  • Status changed from Workable to New

This wasn't estimated, hence setting back to New

Actions #9

Updated by okurz about 1 year ago

  • Target version changed from Tools - Next to Ready
Actions #10

Updated by tinita about 1 year ago

  • Subject changed from multimachine tests restarted end up without the proper dependency to multimachine tests restarted end up without the proper dependency size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #11

Updated by okurz about 1 year ago

  • Subject changed from multimachine tests restarted end up without the proper dependency size:M to multimachine tests restarted by RETRY test variable end up without the proper dependency size:M
Actions #12

Updated by okurz about 1 year ago

  • Related to action #152389: significant increase in MM-test failure ratio 2023-12-11: test fails in multipath_iscsi and other multi-machine scenarios due to MTU size auto_review:"ping with packet size 1350 failed, problems with MTU" size:M added
Actions #13

Updated by mkittler about 1 year ago

  • Assignee set to mkittler
Actions #14

Updated by mkittler about 1 year ago · Edited

A more recent example: https://openqa.suse.de/tests/13093155#dependencies

Scenario: https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-Updates&machine=64bit&test=qam_kernel_multipath&version=15-SP4

This is not reproducible by just cloning the MM cluster and invoking e.g. script/openqa eval 'app->schema->resultset("Jobs")->find(4198)->done(result => "failed")'. The restart happens but the dependency graphs look good (on the original cluster and the new cluster).

From the job logs it becomes also clear that another job actually did run in parallel. The database also shows the parallel dependency:

openqa=> select * from job_dependencies where parent_job_id = 13093155 or child_job_id = 13093155;
 child_job_id | parent_job_id | dependency 
--------------+---------------+------------
     13093155 |      13093031 |          2
     13093155 |      13092988 |          1
(2 rows)

So this is most likely actually just a displaying issue of the dependency graph.

EDIT: When also cloning the chained parent the displaying issue becomes reproducible.

Actions #15

Updated by mkittler about 1 year ago

  • Status changed from Workable to In Progress
Actions #16

Updated by openqa_review about 1 year ago

  • Due date set to 2024-01-02

Setting due date based on mean cycle time of SUSE QE Tools

Actions #17

Updated by mkittler about 1 year ago

  • Status changed from In Progress to Feedback
Actions #18

Updated by okurz about 1 year ago · Edited

  • Due date deleted (2024-01-02)
  • Status changed from Feedback to Resolved

https://github.com/os-autoinst/openQA/pull/5400 merged. I trust the unit tests and screenshots enough. At least something like https://openqa.opensuse.org/tests/3825747#dependencies still looks fine after https://openqa.opensuse.org/changelog shows the change already deployed.

@szarate we verified that the original problem was only impacting visualization so the actual test results were never affected in a negative way.

Actions

Also available in: Atom PDF