action #25892: Scheduling parallel jobs - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #25892

closed

Scheduling parallel jobs

Added by nadvornik over 7 years ago. Updated about 7 years ago.

Status:

Resolved

Priority:

High

Assignee:

EDiGiacinto

Category:

Regressions/Crashes

Target version:

Done

Start date:

2017-10-10

Due date:

% Done:

Estimated time:

Description

After upgrade to the new scheduler I have this problem:
I have this group of parallel jobs:
A
B PARALLEL_WITH A
C PARALLEL_WITH A,B
D PARALLEL_WITH A,B

The jobs use barriers to synchronize so they finish approximately at the same
time.

On an openQA server with 4 workers I have this group scheduled multiple times:
A1, B1, C1, D1, A2, B2, C2, D2, A3, B3, C3, D3, ...

First group of jobs A1, B1, C1, D1 finishes normally but then it sometimes
ends up with workers doing jobs A2, A3, A4, A5 in parallel and waiting
forever for the rest of group.

It looks like some race condition in use of _prefer_parallel() function.

Files

slenkins_jobs_deadlock.png (73.3 KB) slenkins_jobs_deadlock.png

thehejik, 2018-02-13 14:57

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by coolo over 7 years ago

Target version set to Ready

tricky to test case I assume

Actions

Copy link

Updated by mkravec over 7 years ago

We have issues with scheduler at QAM-CaaSP & QA-CaaSP tests.
QAM start 7 clusters (8 jobs per cluster) at the same time, sometimes in addition to QA (12+5+5+5+1+1) clusters.
We have 2*24 dedicated workers (maybe a bit overloaded) handling from 50 to 85 cluster jobs at the same time.

I talked to Ettore and he proposed to start cluster test only when there are enough free workers. I like this solution, this would allow to share CaaSP workers with SLE again.
This way we prevent situations where we have:

5 workers in 1st cluster
7 workers in 2nd cluster
...
6 workers in 3rd cluster
but we actually need 8 workers to finish cluster test.

Current QAM issues:

QA clusters (29 jobs) are more reliable atm. because they are usually started at different time then QAM (56 jobs)

Actions

Copy link

Updated by coolo over 7 years ago

Priority changed from Normal to High

Actions

Copy link

Updated by thehejik over 7 years ago

File slenkins_jobs_deadlock.png slenkins_jobs_deadlock.png added

I have similar problem with slenkins (which is using mutexes and PARALLEL_WITH="sutX,sutY" in *-control job only) on my local openqa instance (fully updated today).

Problem is with a distribution of the workers over irrelevant jobs that cannot be finished without triggering their sibling jobs.

Please see attached screen - I'm using 5 worker processes and some of those workers are blocked in different jobs. It will lead to a deadlock because those partially started jobs will stuck and then killed after 2 hours.

Actions

Copy link

Updated by pcervinka over 7 years ago

We face similar situation in HPC group:
https://openqa.suse.de/tests/overview?distri=sle&version=15&build=473.4&groupid=130

Although are jobs triggered, sometimes is information about relation lost and test wait for each other in deadlock.

Actions

Copy link

Updated by sebchlad over 7 years ago

I was updated by Coolo regarding this problem (which also affects HPC after pcervinka's nice improvements to HPC testing) and I understand this is rather significant work to be done, so we shall not expect any quick solution.

I just wonder however about what Coolo said and what Martin nicely described: "I talked to Ettore and he proposed to start cluster test only when there are enough free workers."

Isn't that quick workaround which we could have before a proper solution?
I understand/guess this would impact time exception but still it might be OK.

Actions

Copy link

Updated by coolo over 7 years ago

Sure, we'll review your patches then.

Actions

Copy link

Updated by sebchlad over 7 years ago

yeah I was actually wondering about this... I would perhaps answer the same way :-)

Actions

Copy link

Updated by EDiGiacinto over 7 years ago

sebchlad wrote:

I was updated by Coolo regarding this problem (which also affects HPC after pcervinka's nice improvements to HPC testing) and I understand this is rather significant work to be done, so we shall not expect any quick solution.

I just wonder however about what Coolo said and what Martin nicely described: "I talked to Ettore and he proposed to start cluster test only when there are enough free workers."

Isn't that quick workaround which we could have before a proper solution?

That workaround is the only other solution i see rather from migrating the whole scheduling to AMQP or start talking about SAT solvers - which would be even more painful.

I understand/guess this would impact time exception but still it might be OK.

What will take most of the time, is to be able to deliver this feature with the guarantee to not impact other jobs.

Actions

Copy link

#10

Updated by oholecek over 7 years ago

We (me and nadvornik) took a brief look into this and think that passing $allocating ( from https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Scheduler/Scheduler.pm#L242 ) to job_grab and then to _prefer_parallel as '$running' should avoid scheduling of more parallel job groups when there are not enough workers for all parallel groups. What do you think?

Btw. what would migrating the whole scheduling to AMQP solve?

Actions

Copy link

#11

Updated by EDiGiacinto over 7 years ago

Status changed from New to In Progress
Assignee set to EDiGiacinto

Actions

Copy link

#12

Updated by EDiGiacinto over 7 years ago

oholecek wrote:

We (me and nadvornik) took a brief look into this and think that passing $allocating ( from https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Scheduler/Scheduler.pm#L242 ) to job_grab and then to _prefer_parallel as '$running' should avoid scheduling of more parallel job groups when there are not enough workers for all parallel groups. What do you think?

It might just work, but i would also add a further check at the end of the allocation round

Btw. what would migrating the whole scheduling to AMQP solve?

e.g. avoid using websockets to send jobs, treating worker_classes like queues and more important trying to formalize the problem during the process

Actions

Copy link

#13

Updated by EDiGiacinto over 7 years ago

EDiGiacinto wrote:

oholecek wrote:

We (me and nadvornik) took a brief look into this and think that passing $allocating ( from https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Scheduler/Scheduler.pm#L242 ) to job_grab and then to _prefer_parallel as '$running' should avoid scheduling of more parallel job groups when there are not enough workers for all parallel groups. What do you think?

It might just work, but i would also add a further check at the end of the allocation round

small update, this wasn't the only thing needed, i'm currently testing now on staging the required changes.

To guarantee (in a best-effort fashion) the assignment (scheduled -> assigned) in parallel of cluster jobs, will be introduced a new variable.

Btw. what would migrating the whole scheduling to AMQP solve?

e.g. avoid using websockets to send jobs, treating worker_classes like queues and more important trying to formalize the problem during the process

Actions

Copy link

#14

Updated by ldevulder over 7 years ago

pcervinka wrote:

We face similar situation in HPC group:

Same for HA tests sometimes...

Actions

Copy link

#15

Updated by coolo over 7 years ago

Why would you need a new variable? Why isn't PARALLEL_WITH good enough?

Actions

Copy link

#16

Updated by EDiGiacinto over 7 years ago

Because forcing to start all cluster jobs only in parallel as default asks for starvation and deadlocks - if we want to hit that road, just say so and this change will be tight to PARALLEL_WITH, without other variables.

But imo i see other problems coming later and won't be able to fix them by just adding or removing more filtering - so i would personally go step by step, and eventually this behavior should go to default once we approve and decide it is good enough, not the opposite way around.

on a side note:
Even considering the allocating job as running, since _prefer_parallel is 'cutting' and not prioritizing as the name may suggest, will make the all scheduled jobs more subject to starvation until we change paradigm (again) and stop relying on database queries for scheduling ( or if we want to scramble the queries, once more, but you are my guest then ), so i see no 'real' solution given the current limitations, but just stacking changes on top.

Actions

Copy link

#17