Project

General

Profile

action #35914

Updated by szarate over 6 years ago

This is a long story biting us from time to time. 

 Apparently changes in https://github.com/os-autoinst/openQA/pull/1623, made a problem that was underlying in our code a bit more obvious. 

 Now, according to mkravec: 

     There is ~10% chance that CaaSP cluster will have incomplete job. If incomplete happens, then: 
     - before this change it failed in "organized" way - jobs started cloning and loosing dependencies, so they never recovered and eventually it all died 
     - now it all goes "kaboom" and weird things happen 
     For example: 
     - https://openqa.suse.de/tests/1661011#settings - how does this job have 2 children QAM-CaaSP-admin 
     - https://openqa.suse.de/tests/1661007 was cloned but lost 1 dependency during that 
     - https://openqa.suse.de/tests/1652849 how can this job find 8 new dependencies after being cloned (it recovered & passed fine at at end) 
     - https://openqa.suse.de/tests/1652962 how can incomplete job have higher ID than one that passed (1652905) - causing incomplete result being displayed in result overview 
     I reschedule ISO when this happens. 


 It was discussed during a meeting, that a possible solution was: 

 * When a Cluster Job is posted (via iso), create a clusterID 
 * Add said clusterID to all of the jobs spawned that belong to the same cluster 
 * When a user wants to restart one job from the cluster, look for all of the jobs with the same clusterID, and restart them (all or nothing) 


 In the meantime: the Job::dublicate function needs to stop being so smart. And refactored. 

 AC1: Cluster jobs are no longer displaying the behaviour described by mkravec or Ettore. (I.e jobs with missing or misconfigured dependencies) 
 AC1.1: ClusterID is introduced and openQA/Scheduler are using it automatically 
 AC1.2: When a single job, from a cluster is restarted/cancelled, the whole cluster behaves as a Borg Collective, and restarts or gets cancelled. 
 AC1.3: Complexity of Job::duplicate is reduced 
 AC2: Job::duplicate function is refactored 
 AC3: Proper unit tests for cases with chained, and parallel with multiple dependencies are written

Back