action #35914: Changes to Job::duplicate - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #35914

closed

Changes to Job::duplicate

Added by szarate about 7 years ago. Updated almost 7 years ago.

Status:

Resolved

Priority:

High

Assignee:

coolo

Category:

Feature requests

Target version:

Done

Start date:

2018-05-04

Due date:

% Done:

Estimated time:

Description

This is a long story biting us from time to time.

Apparently changes in https://github.com/os-autoinst/openQA/pull/1623, made a problem that was underlying in our code a bit more obvious.

Now, according to mkravec:

There is ~10% chance that CaaSP cluster will have incomplete job. If incomplete happens, then:
- before this change it failed in "organized" way - jobs started cloning and loosing dependencies, so they never recovered and eventually it all died
- now it all goes "kaboom" and weird things happen
For example:
- https://openqa.suse.de/tests/1661011#settings - how does this job have 2 children QAM-CaaSP-admin
- https://openqa.suse.de/tests/1661007 was cloned but lost 1 dependency during that
- https://openqa.suse.de/tests/1652849 how can this job find 8 new dependencies after being cloned (it recovered & passed fine at at end)
- https://openqa.suse.de/tests/1652962 how can incomplete job have higher ID than one that passed (1652905) - causing incomplete result being displayed in result overview
I reschedule ISO when this happens.

It was discussed during a meeting, that a possible solution was:

When a Cluster Job is posted (via iso), create a clusterID
Add said clusterID to all of the jobs spawned that belong to the same cluster
When a user wants to restart one job from the cluster, look for all of the jobs with the same clusterID, and restart them (all or nothing)

In the meantime: the Job::dublicate function needs to stop being so smart. And refactored.

AC1: Cluster jobs are no longer displaying the behaviour described by mkravec or Ettore. (I.e jobs with missing or misconfigured dependencies)
AC1.1: ClusterID is introduced and openQA/Scheduler are using it automatically
AC1.2: When a single job, from a cluster is restarted/cancelled, the whole cluster behaves as a Borg Collective, and restarts or gets cancelled.
AC1.3: Complexity of Job::duplicate is reduced
AC2: Job::duplicate function is refactored
AC3: Proper unit tests for cases with chained, and parallel with multiple dependencies are written

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #35914

Changes to Job::duplicate

Updated by szarate about 7 years ago

Updated by szarate about 7 years ago

Updated by szarate about 7 years ago

Updated by szarate about 7 years ago

Updated by dasantiago about 7 years ago

Updated by EDiGiacinto about 7 years ago

Updated by dasantiago about 7 years ago

Updated by EDiGiacinto about 7 years ago

Updated by szarate about 7 years ago

Updated by szarate about 7 years ago

Updated by szarate about 7 years ago

Updated by coolo about 7 years ago

Updated by coolo about 7 years ago

Updated by szarate almost 7 years ago