Project

General

Profile

Actions

action #34504

closed

[tools][sporadic] Job's auto_duplicate fails to duplicate job dependencies

Added by dasantiago almost 7 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2018-04-09
Due date:
% Done:

20%

Estimated time:

Description

I deleted by mistake the poo#32854 instead of my comment.

This is happens sporadically and needs investigation, we can observe most of the times 'flaky' travis tests that fails on it [1].

Apparently auto_duplicate fails to copy part of the job dependencies, it doesn't happen only on CI tests, we have other reports mentioning this issue already.

AC:

  • Identify and fix the issue
  • Stress tests inside our unit-test suite that shows that auto_duplicate is idempotent

1: Test failure points to https://github.com/os-autoinst/openQA/blob/dda08666d6f21473a990e9afcbda6be8a8280b2c/t/05-scheduler-dependencies.t#L785 auto_duplicate()


Files

scheduler-dependencies-failure.patch (1.11 KB) scheduler-dependencies-failure.patch szarate, 2018-04-12 08:02
error.log (29.1 KB) error.log szarate, 2018-04-12 10:00
passing.log (28.4 KB) passing.log Passing test szarate, 2018-04-12 10:05
expected_error.log (28.7 KB) expected_error.log Expected error can be compared with passing.log szarate, 2018-04-12 11:24

Related issues 1 (0 open1 closed)

Related to openQA Project (public) - action #35914: Changes to Job::duplicateResolvedcoolo2018-05-04

Actions
Actions #1

Updated by dasantiago almost 7 years ago

This can be reproduced by the test:

time while perl t/05-scheduler-dependencies.t; do echo next; done

Actions #2

Updated by dasantiago almost 7 years ago

There's two problems with this test:

1-

This query https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Schema/Result/Jobs.pm#L735 doesn't enforce any type of order and the is_deeply from Test::More contains some limitations, it requires the whole data structure to match exactly. There is no place for any flexibility. Although the following chained dependencies are equivalent the test will fail:

[$jobB2_h->{id}, $jobC2_h->{id}, $jobD2_h->{id}] contains:

[ 100032,
100030,
100031
]

$jobA2_h->{children}->{Chained} contains:

[ 100030,
100031,
100032
]

The dependencies are the same, but it will fail.

2- Some dependencies are not returned.
Still debugging

Actions #3

Updated by dasantiago almost 7 years ago

I'm still analyzing the point 2 mentioned above and via analysis of the SQL queries i can confirm that in the cases where it fails, there some insert jobs in the DB that aren't being done causing the tests to fail.

Actions #4

Updated by szarate almost 7 years ago

After running while prove --verbose --color t/05-scheduler-dependencies.t; do echo "Not failed"; done a test (different than auto_duplicate()) fails too. Which means that sorting just hides part of the problem.

# [                                        
#   100030,                                    
#   100031,                
#   100032                  
# ]                         
# [                                        
#   100030,                                    
#   100031,               
#   100032
# ]                         
ok 176 - jobA2 has jobB2, jobC2 and jobD2 as children
[debug] new job 100042
ok 177 - job cloned                   
ok 178 - job has jobA2 as parent           
ok 179 - job cloned   
ok 180 - job has jobA2 as parent
ok 181 - job cloned
ok 182 - job has jobA2 as parent
ok 183 - jobA2 is indeed jobA clone                            
[debug] new job 100043
ok 184 - job correctly not cloned
ok 185 - job has jobA3 as parent                                                               
ok 186 - job correctly not cloned
ok 187 - job has jobA3 as parent                                          
ok 188 - job correctly not cloned                                            
ok 189 - job has jobA3 as parent                                                                                                                                                                                                                        
[debug] new job 100051                                                                                                                                                                                                                                     
ok 190 - job cloned                                                                                                                                                                                                                                               
not ok 191 - job has jobA2 as parent                                                                                                                                                                                             
#   Failed test 'job has jobA2 as parent'                                                                                                                                                                                     
#   at t/05-scheduler-dependencies.t line 793.                               
#     Structures begin differing at:                                                                                                                                                                                                                    
#          $got->[0] = Does not exist                                                                                                                                                                                                                      
#     $expected->[0] = '100051' 
Actions #5

Updated by dasantiago almost 7 years ago

szarate wrote:

After running while prove --verbose --color t/05-scheduler-dependencies.t; do echo "Not failed"; done a test (different than auto_duplicate()) fails too. Which means that sorting just hides part of the problem.

# [                                        
#   100030,                                    
#   100031,                
#   100032                  
# ]                         
# [                                        
#   100030,                                    
#   100031,               
#   100032
# ]                         
ok 176 - jobA2 has jobB2, jobC2 and jobD2 as children
[debug] new job 100042
ok 177 - job cloned                   
ok 178 - job has jobA2 as parent           
ok 179 - job cloned   
ok 180 - job has jobA2 as parent
ok 181 - job cloned
ok 182 - job has jobA2 as parent
ok 183 - jobA2 is indeed jobA clone                            
[debug] new job 100043
ok 184 - job correctly not cloned
ok 185 - job has jobA3 as parent                                                               
ok 186 - job correctly not cloned
ok 187 - job has jobA3 as parent                                          
ok 188 - job correctly not cloned                                            
ok 189 - job has jobA3 as parent                                                                                                                                                                                                                        
[debug] new job 100051                                                                                                                                                                                                                                     
ok 190 - job cloned                                                                                                                                                                                                                                               
not ok 191 - job has jobA2 as parent                                                                                                                                                                                             
#   Failed test 'job has jobA2 as parent'                                                                                                                                                                                     
#   at t/05-scheduler-dependencies.t line 793.                               
#     Structures begin differing at:                                                                                                                                                                                                                    
#          $got->[0] = Does not exist                                                                                                                                                                                                                      
#     $expected->[0] = '100051'

It doesn't hide. It's a different problem that i mentioned on point 2.

The sort just fixes the test. The problem on test 191, is that the clone job isn't being created. This is the real problem.

Actions #6

Updated by szarate almost 7 years ago

Run after $schema->storage->debug(1);

Actions #7

Updated by szarate almost 7 years ago

Actions #9

Updated by szarate almost 7 years ago

Saving comments from Mudler from poo#32858:

As promised moving back into Ready if takes more than 1 day.
Didn't had much luck with that - it is hard to reproduce locally (just 1 out of ~30 execution run of scheduler_dependencies test fails for me) so it's expensive in terms of time, and difficult to identify the real issue as well (and be sure it's fixed, as we have already a lot of false positives).
At least i can exclude that when it does happen - my hunch in the comment before is not reached, so must be something else.
Tried wrapping everything in a transaction as well (since it is searching and cloning, recursively, multiple invocations could create race conditions) but made tests fails more horribly, and that road takes for sure more than one day - i'm afraid we will have to refactor this if starts to become even more problematic.
Actions #10

Updated by dasantiago almost 7 years ago

  • % Done changed from 0 to 80

Issue found and fixed, but it broke some other tests. I'm fixing the remaining tests.

Actions #11

Updated by dasantiago almost 7 years ago

Only UI tests (not related) are failing in travis

Actions #12

Updated by dasantiago almost 7 years ago

  • % Done changed from 80 to 100
Actions #13

Updated by dasantiago over 6 years ago

  • % Done changed from 100 to 90

Need to improve the tests before it can be closed.

Actions #14

Updated by dasantiago over 6 years ago

  • % Done changed from 90 to 100

The changes were merged yesterday.

More tests are implemented and the PR already done.

Actions #15

Updated by dasantiago over 6 years ago

  • Status changed from In Progress to Resolved
Actions #17

Updated by dasantiago over 6 years ago

EDiGiacinto wrote:

Reopening since now duplication is not working properly

https://openqa.suse.de/tests/1659294#settings
https://openqa.suse.de/tests/1652849
https://openqa.suse.de/tests/1644104#settings
https://openqa.suse.de/tests/overview?distri=caasp&version=3.0&build=0073&groupid=134

You have to be more specific in what's wrong with the duplication.

Is the skipped job in https://openqa.suse.de/tests/1644104#settings ?
And ahow about the other jobs? It's only on caasp jobs that this problem is happening?

Actions #18

Updated by EDiGiacinto over 6 years ago

dasantiago wrote:

EDiGiacinto wrote:

Reopening since now duplication is not working properly

https://openqa.suse.de/tests/1659294#settings
https://openqa.suse.de/tests/1652849
https://openqa.suse.de/tests/1644104#settings
https://openqa.suse.de/tests/overview?distri=caasp&version=3.0&build=0073&groupid=134

You have to be more specific in what's wrong with the duplication.

Adding in CC then who might explain you better, you have QA Engineers responsible for those tests in your room ;)

Basically - there are jobs that now are even decoupled from the cluster at all when you hit the restart button.

Is the skipped job in https://openqa.suse.de/tests/1644104#settings ?

It's not a matter of the job results, it's that they are not tied anymore in the same cluster after restarting them. This is more important when automatic restarts comes in place.

And ahow about the other jobs? It's only on caasp jobs that this problem is happening?

No, also ses, and potentially all MM jobs.

Actions #19

Updated by mkravec over 6 years ago

There is ~10% chance that CaaSP cluster will have incomplete job. If incomplete happens, then:

  • before this change it failed in "organized" way - jobs started cloning and loosing dependencies, so they never recovered and eventually it all died
  • now it all goes "kaboom" and weird things happen

For example:

I reschedule ISO when this happens.

Actions #20

Updated by dasantiago over 6 years ago

My changes only adds a dependency that was missing, so because of that change it shouldn't lose anything.

For example:
https://openqa.suse.de/tests/1661011#settings

This probably was introduced with my change, but the losing deps? :-(

Is there any multimachine test environment that i can use for my needs? Or is there any way to simulate a multi machine environment?

Actions #21

Updated by dasantiago over 6 years ago

  • % Done changed from 100 to 20
Actions #22

Updated by EDiGiacinto over 6 years ago

dasantiago wrote:

My changes only adds a dependency that was missing, so because of that change it shouldn't lose anything.

Yes and no, as it's not only adding, before it was just skipping - if you look closely at https://github.com/os-autoinst/openQA/pull/1623/files#diff-85ae48e70a5c110c9e439c3a5ea28d5fR759 you can also skip other child deps ( see next() ) and ignore the other conditions down while cycling, that could call duplicate (recursively, again) and i suspect that can bring also to loose duplicated jobs dependencies; even if unpleasant, duplicate() looks buggy, to fix properly this we would need to rewrite it from scratch

For example:
https://openqa.suse.de/tests/1661011#settings

This probably was introduced with my change, but the losing deps? :-(

Is there any multimachine test environment that i can use for my needs? Or is there any way to simulate a multi machine environment?

Actions #23

Updated by szarate over 6 years ago

  • Status changed from In Progress to Resolved

Setting to resolved, work will continue in: poo#35914 as the initial problem here was solved.

Actions #24

Updated by szarate over 6 years ago

Actions #25

Updated by szarate over 6 years ago

  • Target version changed from Current Sprint to Done
Actions

Also available in: Atom PDF