action #40415

Concurrent jobs with dependencies don't work if they are on different machines.

Added by jlausuch over 1 year ago. Updated about 1 year ago.

Status:ResolvedStart date:29/08/2018
Priority:HighDue date:
Assignee:mitiao% Done:

0%

Category:Feature requests
Target version:Done
Difficulty:
Duration:

Description

Reproducibility:

We have 2 jobs, let's say PARENT and CHILD, where CHILD has PARALLEL_WITH=PARENT.

I have created 2 tests that are on the same machine "64bit", qemu with some other options:
http://fromm.arch.suse.de/tests/1394
http://fromm.arch.suse.de/tests/1395

The parent job needs the child ID for the mutex command:

my $children       = get_children();
my $child_id       = (keys %$children)[0];
...
script_run("echo Waiting for child with child_id=$child_id");
mutex_wait("child_ready", $child_id);

This is one line of the parent's output:

Waiting for child with child_id=1399

Everything OK so far. CHILD recognizes PARENT as its parent and locking api works without problems.

Then, I have created another machine "64bit-other" with the exact same characteristics as the other one. http://fromm.arch.suse.de/admin/machines
And assign CHILD to "64bit-other" in the job group.

The result is that CHILD doesn't have the parent job in the settings panel any more, and the PARENT's output is now:

Waiting for child with child_id=

Therefore, the command

mutex_wait("child_ready", $child_id);

waits forever.

Why having different machines? Well, for virtual jobs it doesn't make sense, but for BareMetal jobs like NFV and InfiniBand tests we are using different workers and machines:
ipmi-sonic and ipmi-tails with different worker classes: 64bit-mlx_con5_sonic and 64bit-mlx_con5_tails respectively.


Related issues

Related to openQA Project - action #25892: Scheduling parallel jobs Resolved 10/10/2017
Related to openQA Tests - action #42857: [functional][u][s390x] Change structure of s390x KVM host... Workable 24/10/2018

History

#1 Updated by EDiGiacinto over 1 year ago

  • Category set to 122

That's a feature that should also consider adapting what was done for https://progress.opensuse.org/issues/25892

#2 Updated by EDiGiacinto over 1 year ago

#3 Updated by coolo over 1 year ago

  • Target version set to Current Sprint

#4 Updated by coolo over 1 year ago

The tricky part is finding the limits - e.g. if you schedule in one job group multiple server/client pairs on different hardware/architecture. So we'll need some kind of 'finding nearest partner' and error out if we can't clearly identify it.

#5 Updated by mitiao over 1 year ago

  • Assignee set to mitiao

#6 Updated by mitiao over 1 year ago

  • Status changed from New to In Progress

#8 Updated by cfconrad over 1 year ago

What about explicitly define the machine, like:

START_AFTER_TEST=upload_img:64bit

or

PARALLEL_WITH=test1:%MACHINE%-foo,test2:%MACHINE%-baar

#9 Updated by coolo over 1 year ago

this might work for you as you only have one architecture. But in every other scenario it means duplicating test suites because you need to hardcode machine names in test suite settings.

#13 Updated by mitiao over 1 year ago

  • Status changed from In Progress to Resolved

Resolved as PR merged.

#14 Updated by okurz over 1 year ago

  • Status changed from Resolved to In Progress

Please check http://open.qa/docs/#_inter_machine_dependencies, the documentation seems to be broken on "===== Example", maybe just a missing blank line?

#15 Updated by mitiao over 1 year ago

  • Status changed from In Progress to Feedback

okurz wrote:

Please check http://open.qa/docs/#_inter_machine_dependencies, the documentation seems to be broken on "===== Example", maybe just a missing blank line?

Thanks for check, fix in
https://github.com/os-autoinst/openQA/pull/1859

#16 Updated by okurz over 1 year ago

  • Status changed from Feedback to Resolved

#17 Updated by mitiao over 1 year ago

Sorry, another fix for the doc
https://github.com/os-autoinst/openQA/pull/1865
Should be fine finally.

#18 Updated by okurz over 1 year ago

  • Related to action #42857: [functional][u][s390x] Change structure of s390x KVM hosts on production (o.s.d) added

#19 Updated by coolo about 1 year ago

  • Target version changed from Current Sprint to Done

Also available in: Atom PDF