Project

General

Profile

Actions

action #40415

closed

Concurrent jobs with dependencies don't work if they are on different machines.

Added by jlausuch over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Feature requests
Target version:
Start date:
2018-08-29
Due date:
% Done:

0%

Estimated time:

Description

Reproducibility:

We have 2 jobs, let's say PARENT and CHILD, where CHILD has PARALLEL_WITH=PARENT.

I have created 2 tests that are on the same machine "64bit", qemu with some other options:
http://fromm.arch.suse.de/tests/1394
http://fromm.arch.suse.de/tests/1395

The parent job needs the child ID for the mutex command:

my $children       = get_children();
my $child_id       = (keys %$children)[0];
...
script_run("echo Waiting for child with child_id=$child_id");
mutex_wait("child_ready", $child_id);

This is one line of the parent's output:

Waiting for child with child_id=1399

Everything OK so far. CHILD recognizes PARENT as its parent and locking api works without problems.

Then, I have created another machine "64bit-other" with the exact same characteristics as the other one. http://fromm.arch.suse.de/admin/machines
And assign CHILD to "64bit-other" in the job group.

The result is that CHILD doesn't have the parent job in the settings panel any more, and the PARENT's output is now:

Waiting for child with child_id=

Therefore, the command

mutex_wait("child_ready", $child_id);

waits forever.

Why having different machines? Well, for virtual jobs it doesn't make sense, but for BareMetal jobs like NFV and InfiniBand tests we are using different workers and machines:
ipmi-sonic and ipmi-tails with different worker classes: 64bit-mlx_con5_sonic and 64bit-mlx_con5_tails respectively.


Related issues 2 (0 open2 closed)

Related to openQA Project - action #25892: Scheduling parallel jobsResolvedEDiGiacinto2017-10-10

Actions
Related to openQA Tests - action #42857: [qe-core][functional][s390x] Change structure of s390x KVM hosts on production (o.s.d)Resolved2018-10-24

Actions
Actions #1

Updated by EDiGiacinto over 5 years ago

  • Category set to 122

That's a feature that should also consider adapting what was done for https://progress.opensuse.org/issues/25892

Actions #2

Updated by EDiGiacinto over 5 years ago

Actions #3

Updated by coolo over 5 years ago

  • Target version set to Current Sprint
Actions #4

Updated by coolo over 5 years ago

The tricky part is finding the limits - e.g. if you schedule in one job group multiple server/client pairs on different hardware/architecture. So we'll need some kind of 'finding nearest partner' and error out if we can't clearly identify it.

Actions #5

Updated by mitiao over 5 years ago

  • Assignee set to mitiao
Actions #6

Updated by mitiao over 5 years ago

  • Status changed from New to In Progress
Actions #8

Updated by cfconrad over 5 years ago

What about explicitly define the machine, like:

START_AFTER_TEST=upload_img:64bit

or

PARALLEL_WITH=test1:%MACHINE%-foo,test2:%MACHINE%-baar
Actions #9

Updated by coolo over 5 years ago

this might work for you as you only have one architecture. But in every other scenario it means duplicating test suites because you need to hardcode machine names in test suite settings.

Actions #13

Updated by mitiao over 5 years ago

  • Status changed from In Progress to Resolved

Resolved as PR merged.

Actions #14

Updated by okurz over 5 years ago

  • Status changed from Resolved to In Progress

Please check http://open.qa/docs/#_inter_machine_dependencies, the documentation seems to be broken on "===== Example", maybe just a missing blank line?

Actions #15

Updated by mitiao over 5 years ago

  • Status changed from In Progress to Feedback

okurz wrote:

Please check http://open.qa/docs/#_inter_machine_dependencies, the documentation seems to be broken on "===== Example", maybe just a missing blank line?

Thanks for check, fix in
https://github.com/os-autoinst/openQA/pull/1859

Actions #16

Updated by okurz over 5 years ago

  • Status changed from Feedback to Resolved
Actions #17

Updated by mitiao over 5 years ago

Sorry, another fix for the doc
https://github.com/os-autoinst/openQA/pull/1865
Should be fine finally.

Actions #18

Updated by okurz over 5 years ago

  • Related to action #42857: [qe-core][functional][s390x] Change structure of s390x KVM hosts on production (o.s.d) added
Actions #19

Updated by coolo over 5 years ago

  • Target version changed from Current Sprint to Done
Actions

Also available in: Atom PDF