Project

General

Profile

action #40415

Concurrent jobs with dependencies don't work if they are on different machines.

Added by jlausuch about 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Feature requests
Target version:
Start date:
2018-08-29
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Reproducibility:

We have 2 jobs, let's say PARENT and CHILD, where CHILD has PARALLEL_WITH=PARENT.

I have created 2 tests that are on the same machine "64bit", qemu with some other options:
http://fromm.arch.suse.de/tests/1394
http://fromm.arch.suse.de/tests/1395

The parent job needs the child ID for the mutex command:

my $children       = get_children();
my $child_id       = (keys %$children)[0];
...
script_run("echo Waiting for child with child_id=$child_id");
mutex_wait("child_ready", $child_id);

This is one line of the parent's output:

Waiting for child with child_id=1399

Everything OK so far. CHILD recognizes PARENT as its parent and locking api works without problems.

Then, I have created another machine "64bit-other" with the exact same characteristics as the other one. http://fromm.arch.suse.de/admin/machines
And assign CHILD to "64bit-other" in the job group.

The result is that CHILD doesn't have the parent job in the settings panel any more, and the PARENT's output is now:

Waiting for child with child_id=

Therefore, the command

mutex_wait("child_ready", $child_id);

waits forever.

Why having different machines? Well, for virtual jobs it doesn't make sense, but for BareMetal jobs like NFV and InfiniBand tests we are using different workers and machines:
ipmi-sonic and ipmi-tails with different worker classes: 64bit-mlx_con5_sonic and 64bit-mlx_con5_tails respectively.


Related issues

Related to openQA Project - action #25892: Scheduling parallel jobsResolved2017-10-10

Related to openQA Tests - action #42857: [qe-core][functional][s390x] Change structure of s390x KVM hosts on production (o.s.d)Resolved2018-10-24

History

#1 Updated by EDiGiacinto about 3 years ago

  • Category set to 122

That's a feature that should also consider adapting what was done for https://progress.opensuse.org/issues/25892

#2 Updated by EDiGiacinto about 3 years ago

#3 Updated by coolo about 3 years ago

  • Target version set to Current Sprint

#4 Updated by coolo about 3 years ago

The tricky part is finding the limits - e.g. if you schedule in one job group multiple server/client pairs on different hardware/architecture. So we'll need some kind of 'finding nearest partner' and error out if we can't clearly identify it.

#5 Updated by mitiao almost 3 years ago

  • Assignee set to mitiao

#6 Updated by mitiao almost 3 years ago

  • Status changed from New to In Progress

#8 Updated by cfconrad almost 3 years ago

What about explicitly define the machine, like:

START_AFTER_TEST=upload_img:64bit

or

PARALLEL_WITH=test1:%MACHINE%-foo,test2:%MACHINE%-baar

#9 Updated by coolo almost 3 years ago

this might work for you as you only have one architecture. But in every other scenario it means duplicating test suites because you need to hardcode machine names in test suite settings.

#13 Updated by mitiao almost 3 years ago

  • Status changed from In Progress to Resolved

Resolved as PR merged.

#14 Updated by okurz almost 3 years ago

  • Status changed from Resolved to In Progress

Please check http://open.qa/docs/#_inter_machine_dependencies, the documentation seems to be broken on "===== Example", maybe just a missing blank line?

#15 Updated by mitiao almost 3 years ago

  • Status changed from In Progress to Feedback

okurz wrote:

Please check http://open.qa/docs/#_inter_machine_dependencies, the documentation seems to be broken on "===== Example", maybe just a missing blank line?

Thanks for check, fix in
https://github.com/os-autoinst/openQA/pull/1859

#16 Updated by okurz almost 3 years ago

  • Status changed from Feedback to Resolved

#17 Updated by mitiao almost 3 years ago

Sorry, another fix for the doc
https://github.com/os-autoinst/openQA/pull/1865
Should be fine finally.

#18 Updated by okurz almost 3 years ago

  • Related to action #42857: [qe-core][functional][s390x] Change structure of s390x KVM hosts on production (o.s.d) added

#19 Updated by coolo almost 3 years ago

  • Target version changed from Current Sprint to Done

Also available in: Atom PDF