Project

General

Profile

Actions

action #169342

closed

coordination #58184: [saga][epic][use case] full version control awareness within openQA

coordination #152847: [epic] version control awareness within openQA for test distributions

Fix scheduling parallel clusters with `PARALLEL_ONE_HOST_ONLY=1` when the openQA jobs depend on Minion jobs e.g. `git_clone` tasks started for the `git_auto_update` feature size:M

Added by mkittler about 1 month ago. Updated 26 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2024-11-05
Due date:
% Done:

0%

Estimated time:

Description

Observation

When enabling the git_auto_update = 'yes' feature in production we noticed that parallel jobs with PARALLEL_ONE_HOST_ONLY=1 setting were scheduled across different host which should not have been happening. All workers definitely had this setting and the dependencies in the cluster were definitely correct.

The theory is: openQA jobs are not considered at all by the scheduler as long as they are blocked by a Minion job. So the code evaluating PARALLEL_ONE_HOST_ONLY=1 might not see the full cluster and thus half-assign a part of the cluster. Later the cluster is going to be repaired but then PARALLEL_ONE_HOST_ONLY=1 is not considered correctly anymore. This theory means that deleting rows for gru dependencies¹ is a non-atomic operation.

If the theory is correct, then the problematic part in the scheduler is this next unless … line:

    for my $job_id (keys %$cluster_jobs) {
        next unless my $cluster_job = $scheduled_jobs->{$job_id};
        $cluster_job->{one_host_only_via_worker} = 1;
    }

It could be used as a starting point for an implementation, though. We could return from the entire function here with a negative result and skip the while cluster for this scheduling tick.

Acceptance criteria

  • AC1: The scheduler handles PARALLEL_ONE_HOST_ONLY=1 correctly if some jobs of the cluster are still blocked by Minion jobs or chained parents. It might simply detect the situation and not consider the cluster at all in the current tick.
  • AC2: Follow-up tickets are created for further issues we have found (non-transactional creation of Minion jobs when restarting jobs, repairing half-scheduled parallel clusters when PARALLEL_ONE_HOST_ONLY=1 is used)

Suggestions

  • Try to reproduce the issue first, e.g. modify the state of the database manually to have a cluster where some openQA jobs are blocked by Minion jobs and some are not. Maybe this can be done by extending an existing unit test.
  • Check whether deleting rows for gru dependencies¹ is actually a non-atomic operation at the default isolation level without an explicit transaction.
  • See https://github.com/os-autoinst/openQA/pull/6045 for further reference.

¹ referring to the deletion in lib/OpenQA/Shared/GruJob.pm:

sub _delete_gru {
    my ($self, $id) = @_;
    my $gru = $self->minion->app->schema->resultset('GruTasks')->find($id);
    $gru->delete() if $gru;
}

sub _fail_gru {
    my ($self, $id, $reason) = @_;
    my $gru = $self->minion->app->schema->resultset('GruTasks')->find($id);
    $gru->fail($reason) if $gru;
}

Related issues 4 (4 open0 closed)

Related to openQA Project (public) - action #169510: Improve non-transactional creation of Minion jobs for Git updates when restarting jobs size:MWorkable2024-11-07

Actions
Related to openQA Project (public) - action #169513: Improve/investigate behavior when repairing half-scheduled parallel clusters when `PARALLEL_ONE_HOST_ONLY=1` is usedNew2024-11-07

Actions
Blocks openQA Project (public) - action #168379: Enable automatic openQA git clone by default size:SBlockedmkittler2024-10-17

Actions
Blocks openQA Infrastructure (public) - action #168376: Enable automatic openQA git clone instead of fetchneedles on OSD size:SBlockedmkittler

Actions
Actions #1

Updated by mkittler about 1 month ago

  • Blocks action #168379: Enable automatic openQA git clone by default size:S added
Actions #2

Updated by mkittler about 1 month ago

  • Blocks action #168376: Enable automatic openQA git clone instead of fetchneedles on OSD size:S added
Actions #3

Updated by okurz about 1 month ago

  • Target version set to Ready
  • Parent task set to #152847
Actions #5

Updated by tinita about 1 month ago

Here is the corresponding scheduler log for those jobs (added empty lines between presumably different scheduler runs):

% xzgrep 1580222[6789] openqa_scheduler.6.xz                                                                                                                                                
[2024-10-29T16:06:57.558769Z] [debug] [pid:16545] Need to schedule 3 parallel jobs for job 15802226 (with priority 60)
[2024-10-29T16:06:57.742255Z] [debug] [pid:16545] Assigned job '15802227' to worker ID '2769'
[2024-10-29T16:06:57.746891Z] [debug] [pid:16545] [Job#15802227] Prepare for being processed by worker 2769
[2024-10-29T16:06:57.801324Z] [debug] [pid:16545] Created network for 15802227: 113
[2024-10-29T16:06:57.837729Z] [debug] [pid:16545] Sent job(s) '15802227' to worker '2769'
[2024-10-29T16:06:57.976717Z] [debug] [pid:16545] Assigned job '15802226' to worker ID '2800'
[2024-10-29T16:06:57.983597Z] [debug] [pid:16545] [Job#15802226] Prepare for being processed by worker 2800
[2024-10-29T16:06:58.019375Z] [debug] [pid:16545] Sent job(s) '15802226' to worker '2800'
[2024-10-29T16:06:58.030805Z] [debug] [pid:16545] Assigned job '15802228' to worker ID '3924'
[2024-10-29T16:06:58.032524Z] [debug] [pid:16545] [Job#15802228] Prepare for being processed by worker 3924
[2024-10-29T16:06:58.065485Z] [debug] [pid:16545] Sent job(s) '15802228' to worker '3924'
[2024-10-29T16:06:58.145095Z] [debug] [pid:16545] Allocated: { job => 15802227, worker => 2769 }
[2024-10-29T16:06:58.145397Z] [debug] [pid:16545] Allocated: { job => 15802226, worker => 2800 }
[2024-10-29T16:06:58.145465Z] [debug] [pid:16545] Allocated: { job => 15802228, worker => 3924 }

[2024-10-29T16:08:30.802120Z] [debug] [pid:16545] Need to schedule 3 parallel jobs for job 15802226 (with priority 60)
[2024-10-29T16:08:30.951727Z] [debug] [pid:16545] Assigned job '15802228' to worker ID '3924'
[2024-10-29T16:08:30.953182Z] [debug] [pid:16545] [Job#15802228] Prepare for being processed by worker 3924
[2024-10-29T16:08:30.995026Z] [debug] [pid:16545] Sent job(s) '15802228' to worker '3924'
[2024-10-29T16:08:31.183667Z] [debug] [pid:16545] Assigned job '15802229' to worker ID '2796'
[2024-10-29T16:08:31.186870Z] [debug] [pid:16545] [Job#15802229] Prepare for being processed by worker 2796
[2024-10-29T16:08:31.226278Z] [debug] [pid:16545] Sent job(s) '15802229' to worker '2796'
[2024-10-29T16:08:31.583825Z] [debug] [pid:16545] Assigned job '15802226' to worker ID '2800'
[2024-10-29T16:08:31.585509Z] [debug] [pid:16545] [Job#15802226] Prepare for being processed by worker 2800
[2024-10-29T16:08:31.630903Z] [debug] [pid:16545] Sent job(s) '15802226' to worker '2800'
[2024-10-29T16:08:31.631547Z] [debug] [pid:16545] Allocated: { job => 15802228, worker => 3924 }
[2024-10-29T16:08:31.631891Z] [debug] [pid:16545] Allocated: { job => 15802229, worker => 2796 }
[2024-10-29T16:08:31.632412Z] [debug] [pid:16545] Allocated: { job => 15802226, worker => 2800 }

[2024-10-29T16:08:38.647418Z] [debug] [pid:16545] Need to schedule 1 parallel jobs for job 15802226 (with priority 60)
[2024-10-29T16:08:38.972736Z] [debug] [pid:16545] Assigned job '15802226' to worker ID '3555'
[2024-10-29T16:08:38.974817Z] [debug] [pid:16545] [Job#15802226] Prepare for being processed by worker 3555
[2024-10-29T16:08:39.016155Z] [debug] [pid:16545] Sent job(s) '15802226' to worker '3555'
[2024-10-29T16:08:39.254325Z] [debug] [pid:16545] Allocated: { job => 15802226, worker => 3555 }

Here you can see that 15802226 was first assigned to worker 2800 (worker36), but later to 3555 (worker40), and for some reason the three other jobs are not part of its dependencies anymore (Need to schedule 1 parallel jobs for job 15802226).

Actions #6

Updated by tinita about 1 month ago

openqa=> select id, t_created, t_started, t_finished, t_updated, state, result, assigned_worker_id from jobs where id in (15802226, 15802227, 15802228, 15802229);
    id    |      t_created      |      t_started      |     t_finished      |      t_updated      | state |     result      | assigned_worker_id 
----------+---------------------+---------------------+---------------------+---------------------+-------+-----------------+--------------------
 15802226 | 2024-10-29 16:06:54 | 2024-10-29 16:08:39 | 2024-10-29 16:21:45 | 2024-10-29 16:21:45 | done  | parallel_failed |               3555
 15802227 | 2024-10-29 16:06:54 | 2024-10-29 16:06:57 | 2024-10-29 16:21:44 | 2024-10-29 16:21:44 | done  | parallel_failed |               2769
 15802228 | 2024-10-29 16:06:54 | 2024-10-29 16:08:31 | 2024-10-29 16:21:43 | 2024-10-29 16:21:43 | done  | failed          |               3924
 15802229 | 2024-10-29 16:06:54 | 2024-10-29 16:08:31 | 2024-10-29 16:21:46 | 2024-10-29 16:21:46 | done  | parallel_failed |               2796
(4 rows)

As you can see here, the t_started times are corresponding with the 3 scheduler runs. 15802227 was first started at 16:06:57 , then 15802228 and 15802229 were started at 16:08:31 , and then 15802226 at 16:08:39.

Actions #7

Updated by tinita about 1 month ago

What I find weird is that in the first scheduler run it sees and assigns 15802226, 15802227 and 15802228, but in the next run it sees and assigns 15802226, 15802228, 15802229, and then in the third run it sees only 15802226.
So if the scheduler assigns 15802226 in the first run, why is it still unassigned in the next run?

Actions #8

Updated by mkittler about 1 month ago

  • Description updated (diff)
Actions #9

Updated by tinita about 1 month ago · Edited

I looked into the log of worker36 id 2800 instance 34:

% journalctl -u openqa-worker-auto-restart@34.service --since '2024-10-29 16:05:00'
...
Oct 29 16:05:58 worker36 worker[30270]: [warn] [pid:30270] The average load (28.46 28.84 28.10) is exceeding the configured threshold of 25.
Oct 29 16:06:18 worker36 worker[30270]: [debug] [pid:30270] Refusing to grab job from openqa.suse.de: The average load (28.46 28.84 28.10) is exceeding the configured threshold of 25.
Oct 29 16:06:58 worker36 worker[30270]: [debug] [pid:30270] Refusing to grab job from openqa.suse.de: The average load (28.46 28.84 28.10) is exceeding the configured threshold of 25.
Oct 29 16:08:31 worker36 worker[30270]: [debug] [pid:30270] Refusing to grab job from openqa.suse.de: The average load (28.46 28.84 28.10) is exceeding the configured threshold of 25.
Oct 29 16:08:38 worker36 worker[30270]: [debug] [pid:30270] Refusing to grab job from openqa.suse.de: The average load (28.46 28.84 28.10) is exceeding the configured threshold of 25.

So that could explain why it was allocated but then refused, and then allocated later again. But if a worker refuses to take a job which is part of a cluster, then that's problematic in any case I guess

Actions #10

Updated by tinita about 1 month ago · Edited

Meanwhile on instance 3 (id 2769):

% journalctl -u openqa-worker-auto-restart@3.service --since '2024-10-29 16:06:50'
Oct 29 16:06:57 worker36 worker[30150]: [debug] [pid:30150] Accepting job 15802227 from openqa.suse.de.
Oct 29 16:06:57 worker36 worker[30150]: [debug] [pid:30150] Setting job 15802227 from openqa.suse.de up

So 2800 refused a job (15802226), 2769 accepted a job (15802227).
Cluster is already potentially broken there.
That's additional to the problem that it only found 3 instead of 4 parallel jobs.

edit: for completeness id 3924

% journalctl -u openqa-worker-auto-restart@41.service --since '2024-10-29 16:06:00'
Oct 29 16:06:18 worker36 worker[30037]: [debug] [pid:30037] Refusing to grab job from openqa.suse.de: The average load (31.26 28.65 27.95) is exceeding the configured threshold of 25.
Oct 29 16:06:58 worker36 worker[30037]: [debug] [pid:30037] Refusing to grab job from openqa.suse.de: The average load (31.26 28.65 27.95) is exceeding the configured threshold of 25.
Oct 29 16:08:30 worker36 worker[30037]: [debug] [pid:30037] Accepting job 15802228 from openqa.suse.de.

so 15802228 was also refused in the first run, but accepted in the second

Actions #11

Updated by mkittler about 1 month ago

This in indeed another problem. So PARALLEL_ONE_HOST_ONLY=1 does not play well with workers rejecting jobs after an assignment. The code for repairing half-scheduled clusters takes PARALLEL_ONE_HOST_ONLY=1 into account when I remember correctly but it might have issues/limitations and run into the problem that it cannot find a suitable worker slot.

Actions #12

Updated by tinita about 1 month ago · Edited

In case of the high load: how about instead of refusing to take a job, the worker just doesn't start it yet and checks for a reduced load periodically until it can start the job?
At least for cluster jobs which need to be on that specific worker host.

Actions #13

Updated by tinita about 1 month ago · Edited

I just checked another example:

% xzgrep 1580759[3456] openqa_scheduler.6.xz
[2024-10-30T01:34:17.561117Z] [debug] [pid:16545] Need to schedule 3 parallel jobs for job 15807593 (with priority 60)
[2024-10-30T01:34:17.716628Z] [debug] [pid:16545] Assigned job '15807594' to worker ID '2737'
[2024-10-30T01:34:17.724110Z] [debug] [pid:16545] [Job#15807594] Prepare for being processed by worker 2737
[2024-10-30T01:34:17.784315Z] [debug] [pid:16545] Created network for 15807594: 3
[2024-10-30T01:34:17.819122Z] [debug] [pid:16545] Sent job(s) '15807594' to worker '2737'
[2024-10-30T01:34:17.892435Z] [debug] [pid:16545] Assigned job '15807596' to worker ID '2734'
[2024-10-30T01:34:17.896245Z] [debug] [pid:16545] [Job#15807596] Prepare for being processed by worker 2734
[2024-10-30T01:34:17.934917Z] [debug] [pid:16545] Sent job(s) '15807596' to worker '2734'
[2024-10-30T01:34:18.178278Z] [debug] [pid:16545] Assigned job '15807593' to worker ID '3950'
[2024-10-30T01:34:18.180458Z] [debug] [pid:16545] [Job#15807593] Prepare for being processed by worker 3950
[2024-10-30T01:34:18.217118Z] [debug] [pid:16545] Sent job(s) '15807593' to worker '3950'
[2024-10-30T01:34:18.286538Z] [debug] [pid:16545] Allocated: { job => 15807594, worker => 2737 }
[2024-10-30T01:34:18.287174Z] [debug] [pid:16545] Allocated: { job => 15807596, worker => 2734 }
[2024-10-30T01:34:18.287800Z] [debug] [pid:16545] Allocated: { job => 15807593, worker => 3950 }

[2024-10-30T01:34:57.146687Z] [debug] [pid:16545] Need to schedule 1 parallel jobs for job 15807595 (with priority 60)
[2024-10-30T01:34:57.572463Z] [debug] [pid:16545] Assigned job '15807595' to worker ID '3883'
[2024-10-30T01:34:57.584557Z] [debug] [pid:16545] [Job#15807595] Prepare for being processed by worker 3883
[2024-10-30T01:34:57.660666Z] [debug] [pid:16545] Sent job(s) '15807595' to worker '3883'
[2024-10-30T01:34:57.857248Z] [debug] [pid:16545] Allocated: { job => 15807595, worker => 3883 }

In this case it seems there was no job refusal, 3 jobs assigned in the first run (worker35), one in the second (worker32). I also didn't see refusal messages in the worker logs.

Actions #14

Updated by mkittler about 1 month ago · Edited

  • Status changed from New to In Progress
  • Assignee set to mkittler

In case of the high load: how about instead of refusing to take a job, the worker just doesn't start it yet and checks for a reduced load periodically until it can start the job?
At least for cluster jobs which need to be on that specific worker host.

I guess that would be possible but required quite some changes on the worker. So far we simply use the workers self-check mechanism but this needed to be extended or implemented entirely differently.


Considering a worker might also reject jobs due to other reasons I suppose we should just be able to deal with that. And hopefully those cases are rare. Normally, if the worker is broken the scheduler doesn't consider it at all. Here we are in the situation that the worker seemed available but in the time it took for the scheduler to do the assignment it turned inavailable. I suppose that is not a situation we're facing too often.

The starting point for this was the theory about Minion jobs so I'll first see whether I can find something out about the behavior of the cascade deletion. Then I'm going to look into the behavior of the scheduler in case there is a half-scheduled cluster (e.g. after a rejection) and see whether the code for repairing those half-scheduled clusters is flawed.

Actions #15

Updated by mkittler about 1 month ago · Edited

DBIx is really just doing a single delete (as we would expect), e.g.:

BIC_TRACE=1 openqa-start cmd eval 'app->schema->resultset("GruTasks")->find(2486)->delete'
script/openqa eval app->schema->resultset("GruTasks")->find(2486)->delete
…
SELECT me.id, me.taskname, me.args, me.run_at, me.priority, me.t_created, me.t_updated FROM gru_tasks me WHERE ( me.id = ? ): '2486'
BEGIN WORK
DELETE FROM gru_tasks WHERE ( id = ? ): '2486'
SELECT me.job_id, me.gru_task_id FROM gru_dependencies me WHERE ( me.gru_task_id = ? ): '2486'
COMMIT

Relying on ON UPDATE CASCASE:

ALTER TABLE gru_dependencies ADD CONSTRAINT gru_dependencies_fk_gru_task_id FOREIGN KEY (gru_task_id)
  REFERENCES gru_tasks (id) ON DELETE CASCADE ON UPDATE CASCADE DEFERRABLE;

So I'm pretty sure that it is impossible to observe a half-deleted set of gru_dependencies. This would be a "dirty read" and it is not possible in PostgreSQL with any isolation level (see https://www.postgresql.org/docs/current/transaction-iso.html).

We could actually improve the code to use search avoiding the extra select and condition in Perl.


I have also checked the code of the git_auto_update feature and must now say that it is very unlikely the culprit anyway because it never creates Minion jobs associated with jobs. Only the fact that it creates more Minion jobs in general that might delay other git_clone tasks could have an impact. But that delay is very unlikely to have an impact on the relevant openQA jobs because those don't depend on git_clone tasks at all. This is wrong because git_auto_update can lead to creation of Minion jobs associated with openQA jobs because it is checked in create_git_clone_list.

Actions #16

Updated by mkittler about 1 month ago

Alternative theory: This is mainly about clones.

The cluster mentioned in #note-5 (https://openqa.suse.de/tests/15802227) can be explained by the fact that those jobs are clones and we invoke OpenQA::App->singleton->gru->enqueue_git_clones(\%clones, \@clone_ids) if keys %clones; outside of transactions when restarting jobs. This is clearly wrong and needs to be fixed. Otherwise the scheduler will never be able to do the right thing as for a moment it will simply look like the cluster has no Minion jobs to block on at all. The same counts for jobs mentioned in #note-4. (The jobs have been initially skipped/restarted because the chained parent failed and was then restarted.)

All the other jobs linked in #note-13 are just clones as well. Here the initial cluster failed as well why it seemed that the fact that these are clones doesn't matter. However, I checked all initial clusters and they failed due to a different reason but not because jobs were scheduled on different hosts. So only the clones were affected.

Actions #17

Updated by tinita about 1 month ago · Edited

It might be too complicated to move the enqueue_git_clones into the transaction. It seems to me that for job_restart, every job is created in its own transaction. And we don't want to have a minion job for every openQA job.

Maybe it would help to add another job state, e.g. preparing, and only after minion jobs have been created, we set the job to scheduled.
Another workaround would be to have a dummy GruTask prepare that every job gets assigned to first, and then after the git_clone minion job is created, we remove the jobs from the dummy task.

Actions #18

Updated by mkittler about 1 month ago

PR for fixing the restart code: https://github.com/os-autoinst/openQA/pull/6048


This still leaves:

  1. Improving the scheduler because there are also download tasks and those might differ between jobs in a parallel cluster.
  2. Check the code for repairing half-scheduled clusters to handle the case of job rejection.
Actions #19

Updated by mkittler about 1 month ago

And we don't want to have a minion job for every openQA job.

That's indeed not ideal and my PR would of course lead to that. I was hoping that with your changes to de-duplicate Minion jobs when enquing them this wouldn't be a big deal, though.

Actions #20

Updated by mkittler about 1 month ago

Actions #21

Updated by openqa_review about 1 month ago

  • Due date set to 2024-11-21

Setting due date based on mean cycle time of SUSE QE Tools

Actions #22

Updated by mkittler about 1 month ago

  • Subject changed from Fix scheduling parallel clusters with `PARALLEL_ONE_HOST_ONLY=1` when the openQA jobs depend on Minion jobs e.g. `git_clone` tasks started for the `git_auto_update` feature to Fix scheduling parallel clusters with `PARALLEL_ONE_HOST_ONLY=1` when the openQA jobs depend on Minion jobs e.g. `git_clone` tasks started for the `git_auto_update` feature size:M
  • Description updated (diff)
Actions #23

Updated by mkittler about 1 month ago

  • Related to action #169510: Improve non-transactional creation of Minion jobs for Git updates when restarting jobs size:M added
Actions #24

Updated by mkittler about 1 month ago

  • Related to action #169513: Improve/investigate behavior when repairing half-scheduled parallel clusters when `PARALLEL_ONE_HOST_ONLY=1` is used added
Actions #25

Updated by mkittler about 1 month ago

  • Status changed from In Progress to Feedback

I created follow-up tickets (see related tickets).

Actions #26

Updated by tinita about 1 month ago

We got another occurrence, this time as a scheduled product.
According to the logs this times it's solely because of job refusal.

grep -E '15876167|15876289|15876062|15876303' openqa_scheduler
[2024-11-07T11:42:02.870044Z] [debug] [pid:7559] Need to schedule 4 parallel jobs for job 15876062 (with priority 60)
[2024-11-07T11:42:02.879466Z] [debug] [pid:7559] Need to schedule 4 parallel jobs for job 15876167 (with priority 60)
[2024-11-07T11:42:02.881908Z] [debug] [pid:7559] Need to schedule 4 parallel jobs for job 15876289 (with priority 60)
[2024-11-07T11:42:02.882309Z] [debug] [pid:7559] Need to schedule 4 parallel jobs for job 15876303 (with priority 60)

[2024-11-07T11:43:32.957612Z] [debug] [pid:7559] Need to schedule 4 parallel jobs for job 15876062 (with priority 60)
[2024-11-07T11:43:32.984504Z] [debug] [pid:7559] Need to schedule 4 parallel jobs for job 15876167 (with priority 60)
[2024-11-07T11:43:32.989749Z] [debug] [pid:7559] Need to schedule 4 parallel jobs for job 15876289 (with priority 60)
[2024-11-07T11:43:32.991098Z] [debug] [pid:7559] Need to schedule 4 parallel jobs for job 15876303 (with priority 60)

[2024-11-07T11:46:32.711542Z] [debug] [pid:7559] Need to schedule 4 parallel jobs for job 15876062 (with priority 60)
[2024-11-07T11:46:32.743506Z] [debug] [pid:7559] Need to schedule 4 parallel jobs for job 15876167 (with priority 60)
[2024-11-07T11:46:32.755792Z] [debug] [pid:7559] Need to schedule 4 parallel jobs for job 15876289 (with priority 60)
[2024-11-07T11:46:32.757076Z] [debug] [pid:7559] Need to schedule 4 parallel jobs for job 15876303 (with priority 60)

[2024-11-07T11:46:39.372563Z] [debug] [pid:7559] Need to schedule 4 parallel jobs for job 15876062 (with priority 60)
[2024-11-07T11:46:39.388877Z] [debug] [pid:7559] Assigned job '15876167' to worker ID '3940'
[2024-11-07T11:46:39.390943Z] [debug] [pid:7559] [Job#15876167] Prepare for being processed by worker 3940
[2024-11-07T11:46:39.436623Z] [debug] [pid:7559] Created network for 15876167: 60
[2024-11-07T11:46:39.480393Z] [debug] [pid:7559] Sent job(s) '15876167' to worker '3940'
[2024-11-07T11:46:39.489895Z] [debug] [pid:7559] Assigned job '15876062' to worker ID '2773'
[2024-11-07T11:46:39.493493Z] [debug] [pid:7559] [Job#15876062] Prepare for being processed by worker 2773
[2024-11-07T11:46:39.526091Z] [debug] [pid:7559] Sent job(s) '15876062' to worker '2773'
[2024-11-07T11:46:39.537092Z] [debug] [pid:7559] Assigned job '15876303' to worker ID '3987'
[2024-11-07T11:46:39.538930Z] [debug] [pid:7559] [Job#15876303] Prepare for being processed by worker 3987
[2024-11-07T11:46:39.580447Z] [debug] [pid:7559] Sent job(s) '15876303' to worker '3987'
[2024-11-07T11:46:39.597380Z] [debug] [pid:7559] Assigned job '15876289' to worker ID '2792'
[2024-11-07T11:46:39.600071Z] [debug] [pid:7559] [Job#15876289] Prepare for being processed by worker 2792
[2024-11-07T11:46:39.638486Z] [debug] [pid:7559] Sent job(s) '15876289' to worker '2792'
[2024-11-07T11:46:39.638966Z] [debug] [pid:7559] Allocated: { job => 15876167, worker => 3940 }
[2024-11-07T11:46:39.639058Z] [debug] [pid:7559] Allocated: { job => 15876062, worker => 2773 }
[2024-11-07T11:46:39.639138Z] [debug] [pid:7559] Allocated: { job => 15876303, worker => 3987 }
[2024-11-07T11:46:39.639212Z] [debug] [pid:7559] Allocated: { job => 15876289, worker => 2792 }

[2024-11-07T11:46:54.281977Z] [debug] [pid:7559] Need to schedule 2 parallel jobs for job 15876062 (with priority 60)
[2024-11-07T11:46:54.437672Z] [debug] [pid:7559] Assigned job '15876303' to worker ID '3919'
[2024-11-07T11:46:54.439224Z] [debug] [pid:7559] [Job#15876303] Prepare for being processed by worker 3919
[2024-11-07T11:46:54.478547Z] [debug] [pid:7559] Sent job(s) '15876303' to worker '3919'
[2024-11-07T11:46:54.549376Z] [debug] [pid:7559] Assigned job '15876062' to worker ID '2706'
[2024-11-07T11:46:54.550789Z] [debug] [pid:7559] [Job#15876062] Prepare for being processed by worker 2706
[2024-11-07T11:46:54.585323Z] [debug] [pid:7559] Sent job(s) '15876062' to worker '2706'
[2024-11-07T11:46:54.669870Z] [debug] [pid:7559] Allocated: { job => 15876303, worker => 3919 }
[2024-11-07T11:46:54.670019Z] [debug] [pid:7559] Allocated: { job => 15876062, worker => 2706 }
openqa=> select id, host, instance from workers where id in (3940, 2773, 3987, 2792);
  id  |   host   | instance 
------+----------+----------
 2773 | worker36 |        7
 2792 | worker36 |       26
 3940 | worker36 |       47
 3987 | worker36 |       57
(4 rows)
% journalctl -u openqa-worker-auto-restart@7.service --since '2024-11-07 11:00:00'| grep Refus
Nov 07 11:46:39 worker36 worker[29218]: [debug] [pid:29218] Refusing to grab job from openqa.suse.de: The average load (39.08 28.50 17.56) is exceeding the configured threshold of 25.
% journalctl -u openqa-worker-auto-restart@26.service --since '2024-11-07 11:00:00'| grep Refus                                                                                             
% journalctl -u openqa-worker-auto-restart@47.service --since '2024-11-07 11:00:00'| grep Refus                                                                                             
% journalctl -u openqa-worker-auto-restart@57.service --since '2024-11-07 11:00:00'| grep Refus                                                                                             
Nov 07 11:46:39 worker36 worker[29405]: [debug] [pid:29405] Refusing to grab job from openqa.suse.de: The average load (34.26 27.69 19.00) is exceeding the configured threshold of 25.
Actions #27

Updated by mkittler about 1 month ago

  • Status changed from Feedback to Resolved

The PR has been merged. We'll still see clusters scheduled across different hosts until we fix #169513. So there's probably no reason to wait until this change has been deployed - especially considering that this was also easily reproducible/testable locally.

At least #168376 and #168379 are now blocked by one less ticket but to move those forward we should fix #169510 first.

Actions #28

Updated by okurz 26 days ago

  • Due date deleted (2024-11-21)
Actions

Also available in: Atom PDF