action #169342
Updated by mkittler about 1 month ago
## Observation
When enabling the `git_auto_update = 'yes'` feature in production we noticed that parallel jobs with `PARALLEL_ONE_HOST_ONLY=1` setting were scheduled across different host which should not have been happening. All workers definitely had this setting and the dependencies in the cluster were definitely correct.
The theory is: openQA jobs are not considered at all by the scheduler as long as they are blocked by a Minion job. So the code evaluating `PARALLEL_ONE_HOST_ONLY=1` might not see the full cluster and thus half-assign a part of the cluster. Later the cluster is going to be repaired but then `PARALLEL_ONE_HOST_ONLY=1` is not considered correctly anymore. This theory means that deleting rows for gru dependencies¹ is a non-atomic operation.
If the theory is correct, then the problematic part in the scheduler is this `next unless …` line:
```
for my $job_id (keys %$cluster_jobs) {
next unless my $cluster_job = $scheduled_jobs->{$job_id};
$cluster_job->{one_host_only_via_worker} = 1;
}
```
It could be used as a starting point for an implementation, though. We could return from the entire function here with a negative result and skip the while cluster for this scheduling tick.
## Acceptance criteria
* **AC1**: The scheduler handles `PARALLEL_ONE_HOST_ONLY=1` correctly if some jobs of the cluster are still blocked by Minion jobs. It might simply detect the situation and not consider the cluster at all in the current tick.
## Suggestions
* Try to reproduce the issue first, e.g. modify the state of the database manually to have a cluster where some openQA jobs are blocked by Minion jobs and some are not. Maybe this can be done by extending an existing unit test.
* Check whether deleting rows for gru dependencies¹ is actually a non-atomic operation at the [default isolation level](https://www.postgresql.org/docs/current/transaction-iso.html) without an explicit transaction.
* See https://github.com/os-autoinst/openQA/pull/6045 for further reference.
---
¹ referring to the deletion in `lib/OpenQA/Shared/GruJob.pm`:
```
sub _delete_gru {
my ($self, $id) = @_;
my $gru = $self->minion->app->schema->resultset('GruTasks')->find($id);
$gru->delete() if $gru;
}
sub _fail_gru {
my ($self, $id, $reason) = @_;
my $gru = $self->minion->app->schema->resultset('GruTasks')->find($id);
$gru->fail($reason) if $gru;
}
```