action #136154
closedcoordination #102906: [saga][epic] Increased stability of tests with less "known failures", known incompletes handled automatically within openQA
multimachine tests restarted by RETRY test variable end up without the proper dependency size:M
Description
Observation¶
I started noticing multiple jobs that are MM missing one or more dependencies:
Normally this job is a MM one, with two jobs https://openqa.suse.de/tests/12210430 -> should look like https://openqa.suse.de/tests/12207579#dependencies
In this case, the RETRY=1 makes the situation worse, causing blocked updates, due to jobs that should not have ever been restarted automatically, see https://openqa.suse.de/tests/12207609
Suggestions¶
- Find a reproducing scenario with multi-machine clusters using RETRY=1
- Create a simple MM cluster locally (maybe within unit tests are by adjusting the local database manually) and invoke the code that is done on an automatic retry (via
RETRY=…
), e.g. in t/10-jobs.t where we already useRETRY
and take a look into t/05-scheduler-dependencies.t - Only then solve this problem in a mob session since only Marius is currently aware of how to do it
Updated by szarate over 1 year ago
- Copied from action #80264: multimachine tests unable to get vars from its pair job added
Updated by okurz over 1 year ago
- Assignee set to okurz
Hm, this has the same subject as the ticket clone source. I assume you forgot to update that?
Updated by okurz over 1 year ago
- Priority changed from Urgent to High
- Target version changed from Ready to Tools - Next
Updated by okurz over 1 year ago
- Status changed from New to Rejected
no response. I assume the ticket was created by mistake.
Updated by szarate over 1 year ago
- Subject changed from multimachine tests unable to get vars from its pair job to multimachine tests restarted end up without the proper dependency
- Status changed from Rejected to Workable
no response. I assume the ticket was created by mistake.
Nope, more read the email but never replied :), fixed the title, ticket description has enough info
Updated by okurz over 1 year ago
- Assignee deleted (
okurz) - Priority changed from High to Normal
- Start date deleted (
2020-11-24)
Ok. We can plan to look into this ticket but if you can please add a link to "latest" and more information because otherwise the jobs you reference might be lost soon
Updated by livdywan over 1 year ago
- Status changed from Workable to New
This wasn't estimated, hence setting back to New
Updated by okurz about 1 year ago
- Target version changed from Tools - Next to Ready
Updated by tinita about 1 year ago
- Subject changed from multimachine tests restarted end up without the proper dependency to multimachine tests restarted end up without the proper dependency size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz about 1 year ago
- Subject changed from multimachine tests restarted end up without the proper dependency size:M to multimachine tests restarted by RETRY test variable end up without the proper dependency size:M
Updated by okurz about 1 year ago
- Related to action #152389: significant increase in MM-test failure ratio 2023-12-11: test fails in multipath_iscsi and other multi-machine scenarios due to MTU size auto_review:"ping with packet size 1350 failed, problems with MTU" size:M added
Updated by mkittler about 1 year ago · Edited
A more recent example: https://openqa.suse.de/tests/13093155#dependencies
This is not reproducible by just cloning the MM cluster and invoking e.g. script/openqa eval 'app->schema->resultset("Jobs")->find(4198)->done(result => "failed")'
. The restart happens but the dependency graphs look good (on the original cluster and the new cluster).
From the job logs it becomes also clear that another job actually did run in parallel. The database also shows the parallel dependency:
openqa=> select * from job_dependencies where parent_job_id = 13093155 or child_job_id = 13093155;
child_job_id | parent_job_id | dependency
--------------+---------------+------------
13093155 | 13093031 | 2
13093155 | 13092988 | 1
(2 rows)
So this is most likely actually just a displaying issue of the dependency graph.
EDIT: When also cloning the chained parent the displaying issue becomes reproducible.
Updated by mkittler about 1 year ago
- Status changed from Workable to In Progress
Updated by openqa_review about 1 year ago
- Due date set to 2024-01-02
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler about 1 year ago
- Status changed from In Progress to Feedback
Updated by okurz about 1 year ago · Edited
- Due date deleted (
2024-01-02) - Status changed from Feedback to Resolved
https://github.com/os-autoinst/openQA/pull/5400 merged. I trust the unit tests and screenshots enough. At least something like https://openqa.opensuse.org/tests/3825747#dependencies still looks fine after https://openqa.opensuse.org/changelog shows the change already deployed.
@szarate we verified that the original problem was only impacting visualization so the actual test results were never affected in a negative way.