Project

General

Profile

Actions

action #35914

closed

Changes to Job::duplicate

Added by szarate over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Feature requests
Target version:
Start date:
2018-05-04
Due date:
% Done:

0%

Estimated time:

Description

This is a long story biting us from time to time.

Apparently changes in https://github.com/os-autoinst/openQA/pull/1623, made a problem that was underlying in our code a bit more obvious.

Now, according to mkravec:

There is ~10% chance that CaaSP cluster will have incomplete job. If incomplete happens, then:
- before this change it failed in "organized" way - jobs started cloning and loosing dependencies, so they never recovered and eventually it all died
- now it all goes "kaboom" and weird things happen
For example:
- https://openqa.suse.de/tests/1661011#settings - how does this job have 2 children QAM-CaaSP-admin
- https://openqa.suse.de/tests/1661007 was cloned but lost 1 dependency during that
- https://openqa.suse.de/tests/1652849 how can this job find 8 new dependencies after being cloned (it recovered & passed fine at at end)
- https://openqa.suse.de/tests/1652962 how can incomplete job have higher ID than one that passed (1652905) - causing incomplete result being displayed in result overview
I reschedule ISO when this happens.

It was discussed during a meeting, that a possible solution was:

  • When a Cluster Job is posted (via iso), create a clusterID
  • Add said clusterID to all of the jobs spawned that belong to the same cluster
  • When a user wants to restart one job from the cluster, look for all of the jobs with the same clusterID, and restart them (all or nothing)

In the meantime: the Job::dublicate function needs to stop being so smart. And refactored.

AC1: Cluster jobs are no longer displaying the behaviour described by mkravec or Ettore. (I.e jobs with missing or misconfigured dependencies)
AC1.1: ClusterID is introduced and openQA/Scheduler are using it automatically
AC1.2: When a single job, from a cluster is restarted/cancelled, the whole cluster behaves as a Borg Collective, and restarts or gets cancelled.
AC1.3: Complexity of Job::duplicate is reduced
AC2: Job::duplicate function is refactored
AC3: Proper unit tests for cases with chained, and parallel with multiple dependencies are written


Related issues 2 (0 open2 closed)

Related to openQA Project (public) - action #34504: [tools][sporadic] Job's auto_duplicate fails to duplicate job dependenciesResolveddasantiago2018-04-09

Actions
Precedes openQA Project (public) - coordination #32851: [tools][EPIC] Scheduling redesignResolvedokurz2018-05-05

Actions
Actions

Also available in: Atom PDF