action #34504

[tools][sporadic] Job's auto_duplicate fails to duplicate job dependencies

Added by dasantiago over 1 year ago. Updated over 1 year ago.

Status:ResolvedStart date:09/04/2018
Priority:NormalDue date:
Assignee:dasantiago% Done:

20%

Category:Feature requests
Target version:Done
Difficulty:
Duration:

Description

I deleted by mistake the poo#32854 instead of my comment.

This is happens sporadically and needs investigation, we can observe most of the times 'flaky' travis tests that fails on it [1].

Apparently auto_duplicate fails to copy part of the job dependencies, it doesn't happen only on CI tests, we have other reports mentioning this issue already.

AC:
- Identify and fix the issue
- Stress tests inside our unit-test suite that shows that auto_duplicate is idempotent

1: Test failure points to https://github.com/os-autoinst/openQA/blob/dda08666d6f21473a990e9afcbda6be8a8280b2c/t/05-scheduler-dependencies.t#L785 auto_duplicate()

scheduler-dependencies-failure.patch Magnifier (1.11 KB) szarate, 12/04/2018 08:02 am

error.log (29.1 KB) szarate, 12/04/2018 10:00 am

passing.log - Passing test (28.4 KB) szarate, 12/04/2018 10:05 am

expected_error.log - Expected error can be compared with passing.log (28.7 KB) szarate, 12/04/2018 11:24 am


Related issues

Related to openQA Project - action #35914: Changes to Job::duplicate Resolved 04/05/2018

History

#1 Updated by dasantiago over 1 year ago

This can be reproduced by the test:

time while perl t/05-scheduler-dependencies.t; do echo next; done

#2 Updated by dasantiago over 1 year ago

There's two problems with this test:

1-

This query https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Schema/Result/Jobs.pm#L735 doesn't enforce any type of order and the is_deeply from Test::More contains some limitations, it requires the whole data structure to match exactly. There is no place for any flexibility. Although the following chained dependencies are equivalent the test will fail:

[$jobB2_h->{id}, $jobC2_h->{id}, $jobD2_h->{id}] contains:

[ 100032,

100030,

100031

]

$jobA2_h->{children}->{Chained} contains:

[ 100030,

100031,

100032

]

The dependencies are the same, but it will fail.

2- Some dependencies are not returned.
Still debugging

#3 Updated by dasantiago over 1 year ago

I'm still analyzing the point 2 mentioned above and via analysis of the SQL queries i can confirm that in the cases where it fails, there some insert jobs in the DB that aren't being done causing the tests to fail.

#4 Updated by szarate over 1 year ago

After running while prove --verbose --color t/05-scheduler-dependencies.t; do echo "Not failed"; done a test (different than auto_duplicate()) fails too. Which means that sorting just hides part of the problem.

# [                                        
#   100030,                                    
#   100031,                
#   100032                  
# ]                         
# [                                        
#   100030,                                    
#   100031,               
#   100032
# ]                         
ok 176 - jobA2 has jobB2, jobC2 and jobD2 as children
[debug] new job 100042
ok 177 - job cloned                   
ok 178 - job has jobA2 as parent           
ok 179 - job cloned   
ok 180 - job has jobA2 as parent
ok 181 - job cloned
ok 182 - job has jobA2 as parent
ok 183 - jobA2 is indeed jobA clone                            
[debug] new job 100043
ok 184 - job correctly not cloned
ok 185 - job has jobA3 as parent                                                               
ok 186 - job correctly not cloned
ok 187 - job has jobA3 as parent                                          
ok 188 - job correctly not cloned                                            
ok 189 - job has jobA3 as parent                                                                                                                                                                                                                        
[debug] new job 100051                                                                                                                                                                                                                                     
ok 190 - job cloned                                                                                                                                                                                                                                               
not ok 191 - job has jobA2 as parent                                                                                                                                                                                             
#   Failed test 'job has jobA2 as parent'                                                                                                                                                                                     
#   at t/05-scheduler-dependencies.t line 793.                               
#     Structures begin differing at:                                                                                                                                                                                                                    
#          $got->[0] = Does not exist                                                                                                                                                                                                                      
#     $expected->[0] = '100051' 

#5 Updated by dasantiago over 1 year ago

szarate wrote:

After running while prove --verbose --color t/05-scheduler-dependencies.t; do echo "Not failed"; done a test (different than auto_duplicate()) fails too. Which means that sorting just hides part of the problem.


# [
# 100030,
# 100031,
# 100032
# ]
# [
# 100030,
# 100031,
# 100032
# ]
ok 176 - jobA2 has jobB2, jobC2 and jobD2 as children
[debug] new job 100042
ok 177 - job cloned
ok 178 - job has jobA2 as parent
ok 179 - job cloned
ok 180 - job has jobA2 as parent
ok 181 - job cloned
ok 182 - job has jobA2 as parent
ok 183 - jobA2 is indeed jobA clone
[debug] new job 100043
ok 184 - job correctly not cloned
ok 185 - job has jobA3 as parent
ok 186 - job correctly not cloned
ok 187 - job has jobA3 as parent
ok 188 - job correctly not cloned
ok 189 - job has jobA3 as parent
[debug] new job 100051
ok 190 - job cloned
not ok 191 - job has jobA2 as parent
# Failed test 'job has jobA2 as parent'
# at t/05-scheduler-dependencies.t line 793.
# Structures begin differing at:
# $got->[0] = Does not exist
# $expected->[0] = '100051'

It doesn't hide. It's a different problem that i mentioned on point 2.

The sort just fixes the test. The problem on test 191, is that the clone job isn't being created. This is the real problem.

#6 Updated by szarate over 1 year ago

Run after $schema->storage->debug(1);

#7 Updated by szarate over 1 year ago

#8 Updated by szarate over 1 year ago

#9 Updated by szarate over 1 year ago

Saving comments from Mudler from poo#32858:

As promised moving back into Ready if takes more than 1 day.
Didn't had much luck with that - it is hard to reproduce locally (just 1 out of ~30 execution run of scheduler_dependencies test fails for me) so it's expensive in terms of time, and difficult to identify the real issue as well (and be sure it's fixed, as we have already a lot of false positives).
At least i can exclude that when it does happen - my hunch in the comment before is not reached, so must be something else.
Tried wrapping everything in a transaction as well (since it is searching and cloning, recursively, multiple invocations could create race conditions) but made tests fails more horribly, and that road takes for sure more than one day - i'm afraid we will have to refactor this if starts to become even more problematic.

#10 Updated by dasantiago over 1 year ago

  • % Done changed from 0 to 80

Issue found and fixed, but it broke some other tests. I'm fixing the remaining tests.

#11 Updated by dasantiago over 1 year ago

Only UI tests (not related) are failing in travis

#12 Updated by dasantiago over 1 year ago

  • % Done changed from 80 to 100

#13 Updated by dasantiago over 1 year ago

  • % Done changed from 100 to 90

Need to improve the tests before it can be closed.

#14 Updated by dasantiago over 1 year ago

  • % Done changed from 90 to 100

The changes were merged yesterday.

More tests are implemented and the PR already done.

#15 Updated by dasantiago over 1 year ago

  • Status changed from In Progress to Resolved

#17 Updated by dasantiago over 1 year ago

EDiGiacinto wrote:

Reopening since now duplication is not working properly


https://openqa.suse.de/tests/1659294#settings
https://openqa.suse.de/tests/1652849
https://openqa.suse.de/tests/1644104#settings
https://openqa.suse.de/tests/overview?distri=caasp&version=3.0&build=0073&groupid=134

You have to be more specific in what's wrong with the duplication.

Is the skipped job in https://openqa.suse.de/tests/1644104#settings ?
And ahow about the other jobs? It's only on caasp jobs that this problem is happening?

#18 Updated by EDiGiacinto over 1 year ago

dasantiago wrote:

EDiGiacinto wrote:

Reopening since now duplication is not working properly


https://openqa.suse.de/tests/1659294#settings
https://openqa.suse.de/tests/1652849
https://openqa.suse.de/tests/1644104#settings
https://openqa.suse.de/tests/overview?distri=caasp&version=3.0&build=0073&groupid=134


You have to be more specific in what's wrong with the duplication.

Adding in CC then who might explain you better, you have QA Engineers responsible for those tests in your room ;)

Basically - there are jobs that now are even decoupled from the cluster at all when you hit the restart button.

Is the skipped job in https://openqa.suse.de/tests/1644104#settings ?

It's not a matter of the job results, it's that they are not tied anymore in the same cluster after restarting them. This is more important when automatic restarts comes in place.

And ahow about the other jobs? It's only on caasp jobs that this problem is happening?

No, also ses, and potentially all MM jobs.

#19 Updated by mkravec over 1 year ago

There is ~10% chance that CaaSP cluster will have incomplete job. If incomplete happens, then:
- before this change it failed in "organized" way - jobs started cloning and loosing dependencies, so they never recovered and eventually it all died
- now it all goes "kaboom" and weird things happen

For example:
- https://openqa.suse.de/tests/1661011#settings - how does this job have 2 children QAM-CaaSP-admin
- https://openqa.suse.de/tests/1661007 was cloned but lost 1 dependency during that
- https://openqa.suse.de/tests/1652849 how can this job find 8 new dependencies after being cloned (it recovered & passed fine at at end)
- https://openqa.suse.de/tests/1652962 how can incomplete job have higher ID than one that passed (1652905) - causing incomplete result being displayed in result overview

I reschedule ISO when this happens.

#20 Updated by dasantiago over 1 year ago

My changes only adds a dependency that was missing, so because of that change it shouldn't lose anything.

For example:
https://openqa.suse.de/tests/1661011#settings

This probably was introduced with my change, but the losing deps? :-(

Is there any multimachine test environment that i can use for my needs? Or is there any way to simulate a multi machine environment?

#21 Updated by dasantiago over 1 year ago

  • % Done changed from 100 to 20

#22 Updated by EDiGiacinto over 1 year ago

dasantiago wrote:

My changes only adds a dependency that was missing, so because of that change it shouldn't lose anything.

Yes and no, as it's not only adding, before it was just skipping - if you look closely at https://github.com/os-autoinst/openQA/pull/1623/files#diff-85ae48e70a5c110c9e439c3a5ea28d5fR759 you can also skip other child deps ( see next() ) and ignore the other conditions down while cycling, that could call duplicate (recursively, again) and i suspect that can bring also to loose duplicated jobs dependencies; even if unpleasant, duplicate() looks buggy, to fix properly this we would need to rewrite it from scratch

For example:
https://openqa.suse.de/tests/1661011#settings


This probably was introduced with my change, but the losing deps? :-(


Is there any multimachine test environment that i can use for my needs? Or is there any way to simulate a multi machine environment?

#23 Updated by szarate over 1 year ago

  • Status changed from In Progress to Resolved

Setting to resolved, work will continue in: poo#35914 as the initial problem here was solved.

#24 Updated by szarate over 1 year ago

#25 Updated by szarate over 1 year ago

  • Target version changed from Current Sprint to Done

Also available in: Atom PDF