action #158185: parallel job failed to get the vars from its pair size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #158185

closed

parallel job failed to get the vars from its pair size:S

Added by Julie_CAO 9 months ago. Updated 9 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

mkittler

Category:

Support

Target version:

openQA Project (public) - Ready

Start date:

2024-03-28

Due date:

% Done:

Estimated time:

Description

Observation¶

We have a parallel job which failed in getting the vars from its pair. Rerun still failed. Is there something wrong with the worker service?

sub get_var_from_parent {
    my ($self, $var) = @_;
    my $parents = get_parents();
    #Query every parent to find the var
    for my $job_id (@$parents) {
        my $ref = get_job_autoinst_vars($job_id);
        return $ref->{$var} if defined $ref->{$var};
    }
    return;
}

https://openqa.suse.de/tests/13885165/logfile?filename=autoinst-log.txt

[2024-03-27T15:39:25.691962Z] [debug] [pid:4639] get_job_autoinst_vars: Connection error: Can't connect: Name or service not known; URL was http://worker35:20493/wS5wkxkWNNB9LK92/vars

Actions

Copy link

Updated by okurz 9 months ago

Category set to Regressions/Crashes
Priority changed from Normal to Urgent
Target version set to Ready

Actions

Copy link

Updated by ybonatakis 9 months ago

shouldnt the WORKER_CLASS include tap device?
Anyhow WORKER_CLASS differs from last successful job

"WORKER_CLASS" : "virt-mm-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,64bit-ipmi-large-mem-intel,64bit-ipmi-ipxe,scooter,region-nue,datacenter-frankencampus,location-nue2,imagetester,cpu-x86_64,cpu-x86_64-v2", vs
"WORKER_CLASS" : "virt-mm-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,fozzie,region-prg,datacenter-dc7,location-prg2,worker35,cpu-x86_64,cpu-x86_64-v2,cpu-x86_64-v3",

Actions

Copy link

Updated by livdywan 9 months ago

Subject changed from parallel job failed to get the vars from its pair to parallel job failed to get the vars from its pair size:S
Description updated (diff)
Category changed from Regressions/Crashes to Support
Status changed from New to Workable

Actions

Copy link

Updated by mkittler 9 months ago

Status changed from Workable to In Progress
Assignee set to mkittler

Actions

Copy link

Updated by mkittler 9 months ago

I think those two conditions lead to the observed problem:

Those jobs were executed across imagetester.qe.nue2.suse.org and worker35.oqa.prg2.suse.org. Those two workers are under different domains which also makes sense as they are on completely different locations.
The get_job_autoinst_vars function is not using the worker's FQDN (even though the FQDN is known to openQA via the WORKER_HOSTNAME worker property).

The problem is that worker35 cannot be resolved from imagetester.qe.nue2.suse.org. This would only be possible from workers under the same domain like worker34.oqa.prg2.suse.org. Normally we only run MM jobs across workers on the same location so not using the FQDN is normally not a problem (and get_job_autoinst_vars is probably rarely used anyway).

The involved jobs have both just virt-mm-64bit-ipmi as worker class. So they are allowed to run on non-tap machines on any location. So the openQA scheduler didn't screw things up.

We can solve this by improving at least one of the mentioned conditions:

Specify a worker class that is only present on workers on a certain location to avoid running those jobs across different locations. You may also use the newly added dependency pinning feature by adding PARALLEL_ONE_HOST_ONLY=1 next to your PARALLEL_WITH=… settings. (This feature is still subject to change as mentioned in the linked documentation. However, you're welcome to give it a try as we need feedback on that feature anyway.)
Make get_job_autoinst_vars use the FQDN. This requires a code change in os-autoinst and openQA. I'll have look how complicated it would be.

Considering that you might run into connectivity problems or performance limitations when running those jobs across workers on different locations I suggest you look into 1. I would meanwhile nevertheless see how much sense it would make for us to implement 2.

Actions

Copy link

Updated by mkittler 9 months ago

Status changed from In Progress to Feedback

This PR will fix get_job_autoinst_vars for this case: https://github.com/os-autoinst/os-autoinst/pull/2478

However, I suggest you still look into avoiding cross-location MM jobs.

Actions

Copy link

Updated by mkittler 9 months ago

Priority changed from Urgent to Normal

Actions

Copy link

Updated by Julie_CAO 9 months ago

Thank you for the quick fix!

However, I suggest you still look into avoiding cross-location MM jobs.

Now we have 3 mm machines, one of them was just relocated in PRG2e. It's difficult for us to avoid using it in the short term, but in the long run, we really need to consider this.

Actions

Copy link

Updated by Julie_CAO 9 months ago

Status changed from Feedback to Resolved

Verify passed. thanks again. https://openqa.suse.de/tests/13919095

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #158185

parallel job failed to get the vars from its pair size:S

Observation¶

Updated by okurz 9 months ago

Updated by ybonatakis 9 months ago

Updated by livdywan 9 months ago

Updated by mkittler 9 months ago

Updated by mkittler 9 months ago

Updated by mkittler 9 months ago

Updated by mkittler 9 months ago

Updated by Julie_CAO 9 months ago

Updated by Julie_CAO 9 months ago