action #34504
closed[tools][sporadic] Job's auto_duplicate fails to duplicate job dependencies
20%
Description
I deleted by mistake the poo#32854 instead of my comment.
This is happens sporadically and needs investigation, we can observe most of the times 'flaky' travis tests that fails on it [1].
Apparently auto_duplicate fails to copy part of the job dependencies, it doesn't happen only on CI tests, we have other reports mentioning this issue already.
AC:
- Identify and fix the issue
- Stress tests inside our unit-test suite that shows that auto_duplicate is idempotent
1: Test failure points to https://github.com/os-autoinst/openQA/blob/dda08666d6f21473a990e9afcbda6be8a8280b2c/t/05-scheduler-dependencies.t#L785 auto_duplicate()
Files
Updated by dasantiago almost 7 years ago
This can be reproduced by the test:
time while perl t/05-scheduler-dependencies.t; do echo next; done
Updated by dasantiago almost 7 years ago
There's two problems with this test:
1-
This query https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Schema/Result/Jobs.pm#L735 doesn't enforce any type of order and the is_deeply from Test::More contains some limitations, it requires the whole data structure to match exactly. There is no place for any flexibility. Although the following chained dependencies are equivalent the test will fail:
[$jobB2_h->{id}, $jobC2_h->{id}, $jobD2_h->{id}] contains:
[ 100032,
100030,
100031
]
$jobA2_h->{children}->{Chained} contains:
[ 100030,
100031,
100032
]
The dependencies are the same, but it will fail.
2- Some dependencies are not returned.
Still debugging
Updated by dasantiago almost 7 years ago
I'm still analyzing the point 2 mentioned above and via analysis of the SQL queries i can confirm that in the cases where it fails, there some insert jobs in the DB that aren't being done causing the tests to fail.
Updated by szarate almost 7 years ago
After running while prove --verbose --color t/05-scheduler-dependencies.t; do echo "Not failed"; done
a test (different than auto_duplicate()) fails too. Which means that sorting just hides part of the problem.
# [
# 100030,
# 100031,
# 100032
# ]
# [
# 100030,
# 100031,
# 100032
# ]
ok 176 - jobA2 has jobB2, jobC2 and jobD2 as children
[debug] new job 100042
ok 177 - job cloned
ok 178 - job has jobA2 as parent
ok 179 - job cloned
ok 180 - job has jobA2 as parent
ok 181 - job cloned
ok 182 - job has jobA2 as parent
ok 183 - jobA2 is indeed jobA clone
[debug] new job 100043
ok 184 - job correctly not cloned
ok 185 - job has jobA3 as parent
ok 186 - job correctly not cloned
ok 187 - job has jobA3 as parent
ok 188 - job correctly not cloned
ok 189 - job has jobA3 as parent
[debug] new job 100051
ok 190 - job cloned
not ok 191 - job has jobA2 as parent
# Failed test 'job has jobA2 as parent'
# at t/05-scheduler-dependencies.t line 793.
# Structures begin differing at:
# $got->[0] = Does not exist
# $expected->[0] = '100051'
Updated by dasantiago almost 7 years ago
szarate wrote:
After
running while prove --verbose --color t/05-scheduler-dependencies.t; do echo "Not failed"; done
a test (different than auto_duplicate()) fails too. Which means that sorting just hides part of the problem.# [ # 100030, # 100031, # 100032 # ] # [ # 100030, # 100031, # 100032 # ] ok 176 - jobA2 has jobB2, jobC2 and jobD2 as children [debug] new job 100042 ok 177 - job cloned ok 178 - job has jobA2 as parent ok 179 - job cloned ok 180 - job has jobA2 as parent ok 181 - job cloned ok 182 - job has jobA2 as parent ok 183 - jobA2 is indeed jobA clone [debug] new job 100043 ok 184 - job correctly not cloned ok 185 - job has jobA3 as parent ok 186 - job correctly not cloned ok 187 - job has jobA3 as parent ok 188 - job correctly not cloned ok 189 - job has jobA3 as parent [debug] new job 100051 ok 190 - job cloned not ok 191 - job has jobA2 as parent # Failed test 'job has jobA2 as parent' # at t/05-scheduler-dependencies.t line 793. # Structures begin differing at: # $got->[0] = Does not exist # $expected->[0] = '100051'
It doesn't hide. It's a different problem that i mentioned on point 2.
The sort just fixes the test. The problem on test 191, is that the clone job isn't being created. This is the real problem.
Updated by szarate almost 7 years ago
- File expected_error.log expected_error.log added
Updated by szarate almost 7 years ago
Saving comments from Mudler from poo#32858:
As promised moving back into Ready if takes more than 1 day.
Didn't had much luck with that - it is hard to reproduce locally (just 1 out of ~30 execution run of scheduler_dependencies test fails for me) so it's expensive in terms of time, and difficult to identify the real issue as well (and be sure it's fixed, as we have already a lot of false positives).
At least i can exclude that when it does happen - my hunch in the comment before is not reached, so must be something else.
Tried wrapping everything in a transaction as well (since it is searching and cloning, recursively, multiple invocations could create race conditions) but made tests fails more horribly, and that road takes for sure more than one day - i'm afraid we will have to refactor this if starts to become even more problematic.
Updated by dasantiago almost 7 years ago
- % Done changed from 0 to 80
Issue found and fixed, but it broke some other tests. I'm fixing the remaining tests.
Updated by dasantiago almost 7 years ago
Only UI tests (not related) are failing in travis
Updated by dasantiago over 6 years ago
- % Done changed from 100 to 90
Need to improve the tests before it can be closed.
Updated by dasantiago over 6 years ago
- % Done changed from 90 to 100
The changes were merged yesterday.
More tests are implemented and the PR already done.
Updated by dasantiago over 6 years ago
- Status changed from In Progress to Resolved
Updated by EDiGiacinto over 6 years ago
- Status changed from Resolved to In Progress
Reopening since now duplication is not working properly
https://openqa.suse.de/tests/1659294#settings
https://openqa.suse.de/tests/1652849
https://openqa.suse.de/tests/1644104#settings
https://openqa.suse.de/tests/overview?distri=caasp&version=3.0&build=0073&groupid=134
Updated by dasantiago over 6 years ago
EDiGiacinto wrote:
Reopening since now duplication is not working properly
https://openqa.suse.de/tests/1659294#settings
https://openqa.suse.de/tests/1652849
https://openqa.suse.de/tests/1644104#settings
https://openqa.suse.de/tests/overview?distri=caasp&version=3.0&build=0073&groupid=134
You have to be more specific in what's wrong with the duplication.
Is the skipped job in https://openqa.suse.de/tests/1644104#settings ?
And ahow about the other jobs? It's only on caasp jobs that this problem is happening?
Updated by EDiGiacinto over 6 years ago
dasantiago wrote:
EDiGiacinto wrote:
Reopening since now duplication is not working properly
https://openqa.suse.de/tests/1659294#settings
https://openqa.suse.de/tests/1652849
https://openqa.suse.de/tests/1644104#settings
https://openqa.suse.de/tests/overview?distri=caasp&version=3.0&build=0073&groupid=134You have to be more specific in what's wrong with the duplication.
Adding in CC then who might explain you better, you have QA Engineers responsible for those tests in your room ;)
Basically - there are jobs that now are even decoupled from the cluster at all when you hit the restart button.
Is the skipped job in https://openqa.suse.de/tests/1644104#settings ?
It's not a matter of the job results, it's that they are not tied anymore in the same cluster after restarting them. This is more important when automatic restarts comes in place.
And ahow about the other jobs? It's only on caasp jobs that this problem is happening?
No, also ses, and potentially all MM jobs.
Updated by mkravec over 6 years ago
There is ~10% chance that CaaSP cluster will have incomplete job. If incomplete happens, then:
- before this change it failed in "organized" way - jobs started cloning and loosing dependencies, so they never recovered and eventually it all died
- now it all goes "kaboom" and weird things happen
For example:
- https://openqa.suse.de/tests/1661011#settings - how does this job have 2 children QAM-CaaSP-admin
- https://openqa.suse.de/tests/1661007 was cloned but lost 1 dependency during that
- https://openqa.suse.de/tests/1652849 how can this job find 8 new dependencies after being cloned (it recovered & passed fine at at end)
- https://openqa.suse.de/tests/1652962 how can incomplete job have higher ID than one that passed (1652905) - causing incomplete result being displayed in result overview
I reschedule ISO when this happens.
Updated by dasantiago over 6 years ago
My changes only adds a dependency that was missing, so because of that change it shouldn't lose anything.
For example:
https://openqa.suse.de/tests/1661011#settings
This probably was introduced with my change, but the losing deps? :-(
Is there any multimachine test environment that i can use for my needs? Or is there any way to simulate a multi machine environment?
Updated by EDiGiacinto over 6 years ago
dasantiago wrote:
My changes only adds a dependency that was missing, so because of that change it shouldn't lose anything.
Yes and no, as it's not only adding, before it was just skipping - if you look closely at https://github.com/os-autoinst/openQA/pull/1623/files#diff-85ae48e70a5c110c9e439c3a5ea28d5fR759 you can also skip other child deps ( see next() ) and ignore the other conditions down while cycling, that could call duplicate (recursively, again) and i suspect that can bring also to loose duplicated jobs dependencies; even if unpleasant, duplicate() looks buggy, to fix properly this we would need to rewrite it from scratch
For example:
https://openqa.suse.de/tests/1661011#settingsThis probably was introduced with my change, but the losing deps? :-(
Is there any multimachine test environment that i can use for my needs? Or is there any way to simulate a multi machine environment?
Updated by szarate over 6 years ago
- Status changed from In Progress to Resolved
Setting to resolved, work will continue in: poo#35914 as the initial problem here was solved.
Updated by szarate over 6 years ago
- Related to action #35914: Changes to Job::duplicate added
Updated by szarate over 6 years ago
- Target version changed from Current Sprint to Done