action #35914

Updated by szarate about 4 years ago

This is a long story biting us from time to time.

Apparently changes in, made a problem that was underlying in our code a bit more obvious.

Now, according to mkravec:

There is ~10% chance that CaaSP cluster will have incomplete job. If incomplete happens, then:
- before this change it failed in "organized" way - jobs started cloning and loosing dependencies, so they never recovered and eventually it all died
- now it all goes "kaboom" and weird things happen
For example:
- - how does this job have 2 children QAM-CaaSP-admin
- was cloned but lost 1 dependency during that
- how can this job find 8 new dependencies after being cloned (it recovered & passed fine at at end)
- how can incomplete job have higher ID than one that passed (1652905) - causing incomplete result being displayed in result overview
I reschedule ISO when this happens.

It was discussed during a meeting, that a possible solution was:

* When a Cluster Job is posted (via iso), create a clusterID
* Add said clusterID to all of the jobs spawned that belong to the same cluster
* When a user wants to restart one job from the cluster, look for all of the jobs with the same clusterID, and restart them (all or nothing)

In the meantime: the Job::dublicate function needs to stop being so smart. And refactored.

AC1: Cluster jobs are no longer displaying the behaviour described by mkravec or Ettore. (I.e jobs with missing or misconfigured dependencies)
AC1.1: ClusterID is introduced and openQA/Scheduler are using it automatically
AC1.2: When a single job, from a cluster is restarted/cancelled, the whole cluster behaves as a Borg Collective, and restarts or gets cancelled.
AC1.3: Complexity of Job::duplicate is reduced
AC2: Job::duplicate function is refactored
AC3: Proper unit tests for cases with chained, and parallel with multiple dependencies are written