Project

General

Profile

Actions

action #169342

closed

coordination #58184: [saga][epic][use case] full version control awareness within openQA

coordination #152847: [epic] version control awareness within openQA for test distributions

Fix scheduling parallel clusters with `PARALLEL_ONE_HOST_ONLY=1` when the openQA jobs depend on Minion jobs e.g. `git_clone` tasks started for the `git_auto_update` feature size:M

Added by mkittler about 1 month ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2024-11-05
Due date:
% Done:

0%

Estimated time:

Description

Observation

When enabling the git_auto_update = 'yes' feature in production we noticed that parallel jobs with PARALLEL_ONE_HOST_ONLY=1 setting were scheduled across different host which should not have been happening. All workers definitely had this setting and the dependencies in the cluster were definitely correct.

The theory is: openQA jobs are not considered at all by the scheduler as long as they are blocked by a Minion job. So the code evaluating PARALLEL_ONE_HOST_ONLY=1 might not see the full cluster and thus half-assign a part of the cluster. Later the cluster is going to be repaired but then PARALLEL_ONE_HOST_ONLY=1 is not considered correctly anymore. This theory means that deleting rows for gru dependencies¹ is a non-atomic operation.

If the theory is correct, then the problematic part in the scheduler is this next unless … line:

    for my $job_id (keys %$cluster_jobs) {
        next unless my $cluster_job = $scheduled_jobs->{$job_id};
        $cluster_job->{one_host_only_via_worker} = 1;
    }

It could be used as a starting point for an implementation, though. We could return from the entire function here with a negative result and skip the while cluster for this scheduling tick.

Acceptance criteria

  • AC1: The scheduler handles PARALLEL_ONE_HOST_ONLY=1 correctly if some jobs of the cluster are still blocked by Minion jobs or chained parents. It might simply detect the situation and not consider the cluster at all in the current tick.
  • AC2: Follow-up tickets are created for further issues we have found (non-transactional creation of Minion jobs when restarting jobs, repairing half-scheduled parallel clusters when PARALLEL_ONE_HOST_ONLY=1 is used)

Suggestions

  • Try to reproduce the issue first, e.g. modify the state of the database manually to have a cluster where some openQA jobs are blocked by Minion jobs and some are not. Maybe this can be done by extending an existing unit test.
  • Check whether deleting rows for gru dependencies¹ is actually a non-atomic operation at the default isolation level without an explicit transaction.
  • See https://github.com/os-autoinst/openQA/pull/6045 for further reference.

¹ referring to the deletion in lib/OpenQA/Shared/GruJob.pm:

sub _delete_gru {
    my ($self, $id) = @_;
    my $gru = $self->minion->app->schema->resultset('GruTasks')->find($id);
    $gru->delete() if $gru;
}

sub _fail_gru {
    my ($self, $id, $reason) = @_;
    my $gru = $self->minion->app->schema->resultset('GruTasks')->find($id);
    $gru->fail($reason) if $gru;
}

Related issues 4 (4 open0 closed)

Related to openQA Project (public) - action #169510: Improve non-transactional creation of Minion jobs for Git updates when restarting jobs size:MWorkable2024-11-07

Actions
Related to openQA Project (public) - action #169513: Improve/investigate behavior when repairing half-scheduled parallel clusters when `PARALLEL_ONE_HOST_ONLY=1` is usedNew2024-11-07

Actions
Blocks openQA Project (public) - action #168379: Enable automatic openQA git clone by default size:SBlockedmkittler2024-10-17

Actions
Blocks openQA Infrastructure (public) - action #168376: Enable automatic openQA git clone instead of fetchneedles on OSD size:SBlockedmkittler

Actions
Actions

Also available in: Atom PDF