Project

General

Profile

Actions

action #158185

closed

parallel job failed to get the vars from its pair size:S

Added by Julie_CAO 3 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Support
Target version:
Start date:
2024-03-28
Due date:
% Done:

0%

Estimated time:

Description

Observation

We have a parallel job which failed in getting the vars from its pair. Rerun still failed. Is there something wrong with the worker service?

sub get_var_from_parent {
    my ($self, $var) = @_;
    my $parents = get_parents();
    #Query every parent to find the var
    for my $job_id (@$parents) {
        my $ref = get_job_autoinst_vars($job_id);
        return $ref->{$var} if defined $ref->{$var};
    }
    return;
}

https://openqa.suse.de/tests/13885165/logfile?filename=autoinst-log.txt

[2024-03-27T15:39:25.691962Z] [debug] [pid:4639] get_job_autoinst_vars: Connection error: Can't connect: Name or service not known; URL was http://worker35:20493/wS5wkxkWNNB9LK92/vars
Actions #1

Updated by okurz 3 months ago

  • Category set to Regressions/Crashes
  • Priority changed from Normal to Urgent
  • Target version set to Ready
Actions #2

Updated by ybonatakis 3 months ago

shouldnt the WORKER_CLASS include tap device?
Anyhow WORKER_CLASS differs from last successful job

  • "WORKER_CLASS" : "virt-mm-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,64bit-ipmi-large-mem-intel,64bit-ipmi-ipxe,scooter,region-nue,datacenter-frankencampus,location-nue2,imagetester,cpu-x86_64,cpu-x86_64-v2", vs
  • "WORKER_CLASS" : "virt-mm-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,fozzie,region-prg,datacenter-dc7,location-prg2,worker35,cpu-x86_64,cpu-x86_64-v2,cpu-x86_64-v3",
Actions #3

Updated by livdywan 3 months ago

  • Subject changed from parallel job failed to get the vars from its pair to parallel job failed to get the vars from its pair size:S
  • Description updated (diff)
  • Category changed from Regressions/Crashes to Support
  • Status changed from New to Workable
Actions #4

Updated by mkittler 3 months ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler
Actions #5

Updated by mkittler 3 months ago

I think those two conditions lead to the observed problem:

  1. Those jobs were executed across imagetester.qe.nue2.suse.org and worker35.oqa.prg2.suse.org. Those two workers are under different domains which also makes sense as they are on completely different locations.
  2. The get_job_autoinst_vars function is not using the worker's FQDN (even though the FQDN is known to openQA via the WORKER_HOSTNAME worker property).

The problem is that worker35 cannot be resolved from imagetester.qe.nue2.suse.org. This would only be possible from workers under the same domain like worker34.oqa.prg2.suse.org. Normally we only run MM jobs across workers on the same location so not using the FQDN is normally not a problem (and get_job_autoinst_vars is probably rarely used anyway).

The involved jobs have both just virt-mm-64bit-ipmi as worker class. So they are allowed to run on non-tap machines on any location. So the openQA scheduler didn't screw things up.

We can solve this by improving at least one of the mentioned conditions:

  1. Specify a worker class that is only present on workers on a certain location to avoid running those jobs across different locations. You may also use the newly added dependency pinning feature by adding PARALLEL_ONE_HOST_ONLY=1 next to your PARALLEL_WITH=… settings. (This feature is still subject to change as mentioned in the linked documentation. However, you're welcome to give it a try as we need feedback on that feature anyway.)
  2. Make get_job_autoinst_vars use the FQDN. This requires a code change in os-autoinst and openQA. I'll have look how complicated it would be.

Considering that you might run into connectivity problems or performance limitations when running those jobs across workers on different locations I suggest you look into 1. I would meanwhile nevertheless see how much sense it would make for us to implement 2.

Actions #6

Updated by mkittler 3 months ago

  • Status changed from In Progress to Feedback

This PR will fix get_job_autoinst_vars for this case: https://github.com/os-autoinst/os-autoinst/pull/2478

However, I suggest you still look into avoiding cross-location MM jobs.

Actions #7

Updated by mkittler 3 months ago

  • Priority changed from Urgent to Normal
Actions #8

Updated by Julie_CAO 3 months ago

Thank you for the quick fix!

However, I suggest you still look into avoiding cross-location MM jobs.

Now we have 3 mm machines, one of them was just relocated in PRG2e. It's difficult for us to avoid using it in the short term, but in the long run, we really need to consider this.

Actions #9

Updated by Julie_CAO 3 months ago

  • Status changed from Feedback to Resolved

Verify passed. thanks again. https://openqa.suse.de/tests/13919095

Actions

Also available in: Atom PDF