action #136154
closed
coordination #102906: [saga][epic] Increased stability of tests with less "known failures", known incompletes handled automatically within openQA
multimachine tests restarted by RETRY test variable end up without the proper dependency size:M
Added by szarate about 1 year ago.
Updated 12 months ago.
Category:
Regressions/Crashes
Description
Observation¶
I started noticing multiple jobs that are MM missing one or more dependencies:
Normally this job is a MM one, with two jobs https://openqa.suse.de/tests/12210430 -> should look like https://openqa.suse.de/tests/12207579#dependencies
In this case, the RETRY=1 makes the situation worse, causing blocked updates, due to jobs that should not have ever been restarted automatically, see https://openqa.suse.de/tests/12207609
Suggestions¶
- Find a reproducing scenario with multi-machine clusters using RETRY=1
- Create a simple MM cluster locally (maybe within unit tests are by adjusting the local database manually) and invoke the code that is done on an automatic retry (via
RETRY=…
), e.g. in t/10-jobs.t where we already use RETRY
and take a look into t/05-scheduler-dependencies.t
- Only then solve this problem in a mob session since only Marius is currently aware of how to do it
- Copied from action #80264: multimachine tests unable to get vars from its pair job added
- Parent task set to #102906
Setting a better looking parent task
Hm, this has the same subject as the ticket clone source. I assume you forgot to update that?
- Priority changed from Urgent to High
- Target version changed from Ready to Tools - Next
- Status changed from New to Rejected
no response. I assume the ticket was created by mistake.
- Subject changed from multimachine tests unable to get vars from its pair job to multimachine tests restarted end up without the proper dependency
- Status changed from Rejected to Workable
no response. I assume the ticket was created by mistake.
Nope, more read the email but never replied :), fixed the title, ticket description has enough info
- Assignee deleted (
okurz)
- Priority changed from High to Normal
- Start date deleted (
2020-11-24)
Ok. We can plan to look into this ticket but if you can please add a link to "latest" and more information because otherwise the jobs you reference might be lost soon
- Status changed from Workable to New
This wasn't estimated, hence setting back to New
- Target version changed from Tools - Next to Ready
- Subject changed from multimachine tests restarted end up without the proper dependency to multimachine tests restarted end up without the proper dependency size:M
- Description updated (diff)
- Status changed from New to Workable
- Subject changed from multimachine tests restarted end up without the proper dependency size:M to multimachine tests restarted by RETRY test variable end up without the proper dependency size:M
- Related to action #152389: significant increase in MM-test failure ratio 2023-12-11: test fails in multipath_iscsi and other multi-machine scenarios due to MTU size auto_review:"ping with packet size 1350 failed, problems with MTU" size:M added
A more recent example: https://openqa.suse.de/tests/13093155#dependencies
Scenario: https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-Updates&machine=64bit&test=qam_kernel_multipath&version=15-SP4
This is not reproducible by just cloning the MM cluster and invoking e.g. script/openqa eval 'app->schema->resultset("Jobs")->find(4198)->done(result => "failed")'
. The restart happens but the dependency graphs look good (on the original cluster and the new cluster).
From the job logs it becomes also clear that another job actually did run in parallel. The database also shows the parallel dependency:
openqa=> select * from job_dependencies where parent_job_id = 13093155 or child_job_id = 13093155;
child_job_id | parent_job_id | dependency
--------------+---------------+------------
13093155 | 13093031 | 2
13093155 | 13092988 | 1
(2 rows)
So this is most likely actually just a displaying issue of the dependency graph.
EDIT: When also cloning the chained parent the displaying issue becomes reproducible.
- Status changed from Workable to In Progress
- Due date set to 2024-01-02
Setting due date based on mean cycle time of SUSE QE Tools
- Status changed from In Progress to Feedback
- Due date deleted (
2024-01-02)
- Status changed from Feedback to Resolved
Also available in: Atom
PDF