Project

General

Profile

action #169342

Updated by mkittler about 1 month ago

## Observation 
 When enabling the `git_auto_update = 'yes'` feature in production we noticed that parallel jobs with `PARALLEL_ONE_HOST_ONLY=1` setting were scheduled across different host which should not have been happening. All workers definitely had this setting and the dependencies in the cluster were definitely correct. 

 The theory is: openQA jobs are not considered at all by the scheduler as long as they are blocked by a Minion job. So the code evaluating `PARALLEL_ONE_HOST_ONLY=1` might not see the full cluster and thus half-assign a part of the cluster. Later the cluster is going to be repaired but then `PARALLEL_ONE_HOST_ONLY=1` is not considered correctly anymore. This theory means that deleting rows for gru dependencies¹ is a non-atomic operation. 

 If the theory is correct, then the problematic part in the scheduler is this `next unless …` line: 

 ``` 
     for my $job_id (keys %$cluster_jobs) { 
         next unless my $cluster_job = $scheduled_jobs->{$job_id}; 
         $cluster_job->{one_host_only_via_worker} = 1; 
     } 
 ``` 

 It could be used as a starting point for an implementation, though. We could return from the entire function here with a negative result and skip the while cluster for this scheduling tick. 

 ## Acceptance criteria 
 * **AC1**: The scheduler handles `PARALLEL_ONE_HOST_ONLY=1` correctly if some jobs of the cluster are still blocked by Minion jobs or chained parents. jobs. It might simply detect the situation and not consider the cluster at all in the current tick. 
 * **AC2**: Follow-up tickets are created for further issues we have found (non-transactional creation of Minion jobs when restarting jobs, repairing half-scheduled parallel clusters when `PARALLEL_ONE_HOST_ONLY=1`    is used) 

 ## Suggestions 
 * Try to reproduce the issue first, e.g. modify the state of the database manually to have a cluster where some openQA jobs are blocked by Minion jobs and some are not. Maybe this can be done by extending an existing unit test. 
 * Check whether deleting rows for gru dependencies¹ is actually a non-atomic operation at the [default isolation level](https://www.postgresql.org/docs/current/transaction-iso.html) without an explicit transaction. 
 * See https://github.com/os-autoinst/openQA/pull/6045 for further reference. 

 --- 

 ¹ referring to the deletion in `lib/OpenQA/Shared/GruJob.pm`: 

 ``` 
 sub _delete_gru { 
     my ($self, $id) = @_; 
     my $gru = $self->minion->app->schema->resultset('GruTasks')->find($id); 
     $gru->delete() if $gru; 
 } 

 sub _fail_gru { 
     my ($self, $id, $reason) = @_; 
     my $gru = $self->minion->app->schema->resultset('GruTasks')->find($id); 
     $gru->fail($reason) if $gru; 
 } 
 ```

Back