action #169342: Fix scheduling parallel clusters with `PARALLEL_ONE_HOST_ONLY=1` when the openQA jobs depend on Minion jobs e.g. `git_clone` tasks started for the `git_auto_update` feature size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #169342

closed

coordination #58184: [saga][epic][use case] full version control awareness within openQA

coordination #152847: [epic] version control awareness within openQA for test distributions

Fix scheduling parallel clusters with `PARALLEL_ONE_HOST_ONLY=1` when the openQA jobs depend on Minion jobs e.g. `git_clone` tasks started for the `git_auto_update` feature size:M

Added by mkittler 5 months ago. Updated 5 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

mkittler

Category:

Feature requests

Target version:

Ready

Start date:

2024-11-05

Due date:

% Done:

Estimated time:

Description

Observation¶

When enabling the git_auto_update = 'yes' feature in production we noticed that parallel jobs with PARALLEL_ONE_HOST_ONLY=1 setting were scheduled across different host which should not have been happening. All workers definitely had this setting and the dependencies in the cluster were definitely correct.

The theory is: openQA jobs are not considered at all by the scheduler as long as they are blocked by a Minion job. So the code evaluating PARALLEL_ONE_HOST_ONLY=1 might not see the full cluster and thus half-assign a part of the cluster. Later the cluster is going to be repaired but then PARALLEL_ONE_HOST_ONLY=1 is not considered correctly anymore. This theory means that deleting rows for gru dependencies¹ is a non-atomic operation.

If the theory is correct, then the problematic part in the scheduler is this next unless … line:

    for my $job_id (keys %$cluster_jobs) {
        next unless my $cluster_job = $scheduled_jobs->{$job_id};
        $cluster_job->{one_host_only_via_worker} = 1;
    }

It could be used as a starting point for an implementation, though. We could return from the entire function here with a negative result and skip the while cluster for this scheduling tick.

Acceptance criteria¶

AC1: The scheduler handles PARALLEL_ONE_HOST_ONLY=1 correctly if some jobs of the cluster are still blocked by Minion jobs or chained parents. It might simply detect the situation and not consider the cluster at all in the current tick.
AC2: Follow-up tickets are created for further issues we have found (non-transactional creation of Minion jobs when restarting jobs, repairing half-scheduled parallel clusters when PARALLEL_ONE_HOST_ONLY=1 is used)

Suggestions¶

Try to reproduce the issue first, e.g. modify the state of the database manually to have a cluster where some openQA jobs are blocked by Minion jobs and some are not. Maybe this can be done by extending an existing unit test.
Check whether deleting rows for gru dependencies¹ is actually a non-atomic operation at the default isolation level without an explicit transaction.
See https://github.com/os-autoinst/openQA/pull/6045 for further reference.

¹ referring to the deletion in lib/OpenQA/Shared/GruJob.pm:

sub _delete_gru {
    my ($self, $id) = @_;
    my $gru = $self->minion->app->schema->resultset('GruTasks')->find($id);
    $gru->delete() if $gru;
}

sub _fail_gru {
    my ($self, $id, $reason) = @_;
    my $gru = $self->minion->app->schema->resultset('GruTasks')->find($id);
    $gru->fail($reason) if $gru;
}

Related issues 4 (1 open — 3 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #169342

Fix scheduling parallel clusters with `PARALLEL_ONE_HOST_ONLY=1` when the openQA jobs depend on Minion jobs e.g. `git_clone` tasks started for the `git_auto_update` feature size:M

Observation¶

Acceptance criteria¶

Suggestions¶

Updated by mkittler 5 months ago

Updated by mkittler 5 months ago

Updated by okurz 5 months ago

Updated by tinita 5 months ago

Updated by tinita 5 months ago

Updated by tinita 5 months ago

Updated by tinita 5 months ago

Updated by mkittler 5 months ago

Updated by tinita 5 months ago · Edited

Updated by tinita 5 months ago · Edited

Updated by mkittler 5 months ago

Updated by tinita 5 months ago · Edited

Updated by tinita 5 months ago · Edited

Updated by mkittler 5 months ago · Edited

Updated by mkittler 5 months ago · Edited

Updated by mkittler 5 months ago

Updated by tinita 5 months ago · Edited

Updated by mkittler 5 months ago

Updated by mkittler 5 months ago

Updated by mkittler 5 months ago

Updated by openqa_review 5 months ago

Updated by mkittler 5 months ago

Updated by mkittler 5 months ago

Updated by mkittler 5 months ago

Updated by mkittler 5 months ago

Updated by tinita 5 months ago

Updated by mkittler 5 months ago

Updated by okurz 5 months ago