Project

General

Profile

action #169342

Updated by mkittler about 1 month ago

## Observation 
 When enabling the `git_auto_update = 'yes'` feature in production we noticed that parallel jobs with `PARALLEL_ONE_HOST_ONLY=1` setting were scheduled across different host which should not have been happening. All workers definitely had this setting and the dependencies in the cluster were definitely correct. 

 The theory is: openQA jobs are not considered at all by the scheduler as long as they are blocked by a Minion job. So the code evaluating `PARALLEL_ONE_HOST_ONLY=1` might not see the full cluster and thus half-assign a part of the cluster. Later the cluster is going to be repaired but then `PARALLEL_ONE_HOST_ONLY=1` is not considered correctly anymore. This theory means that deleting rows for gru dependencies¹ is a non-atomic operation. 

 If the theory is correct, then the problematic part in the scheduler is this `next unless …` line: 

 ``` 
     for my $job_id (keys %$cluster_jobs) { 
         next unless my $cluster_job = $scheduled_jobs->{$job_id}; 
         $cluster_job->{one_host_only_via_worker} = 1; 
     } 
 ``` 

 It could be used as a starting point for an implementation, though. We could return from the entire function here with a negative result and skip the while cluster for this scheduling tick. 

 ## Acceptance criteria 
 * **AC1**: The scheduler handles `PARALLEL_ONE_HOST_ONLY=1` correctly if some jobs of the cluster are still blocked by Minion jobs. It might simply detect the situation and not consider the cluster at all in the current tick. 

 ## Suggestions 
 * Try to reproduce the issue first, e.g. modify the state of the database manually to have a cluster where some openQA jobs are blocked by Minion jobs and some are not. Maybe this can be done by extending an existing unit test. 
 * Check whether deleting rows for gru dependencies¹ is actually a non-atomic operation at the [default isolation level](https://www.postgresql.org/docs/current/transaction-iso.html) without an explicit transaction. 
 * See https://github.com/os-autoinst/openQA/pull/6045 for further reference. 

 --- 

 ¹ referring to the deletion in `lib/OpenQA/Shared/GruJob.pm`: 

 ``` 
 sub _delete_gru { 
     my ($self, $id) = @_; 
     my $gru = $self->minion->app->schema->resultset('GruTasks')->find($id); 
     $gru->delete() if $gru; 
 } 

 sub _fail_gru { 
     my ($self, $id, $reason) = @_; 
     my $gru = $self->minion->app->schema->resultset('GruTasks')->find($id); 
     $gru->fail($reason) if $gru; 
 } 
 ```

Back