action #158185: parallel job failed to get the vars from its pair size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Custom queries

openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE tools team - exceeding due-date
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

Copy link

action #158185

closed

parallel job failed to get the vars from its pair size:S

Added by Julie_CAO 9 months ago. Updated 9 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

mkittler

Category:

Support

Target version:

openQA Project (public) - Ready

Start date:

2024-03-28

Due date:

% Done:

Estimated time:

Description

Observation¶

We have a parallel job which failed in getting the vars from its pair. Rerun still failed. Is there something wrong with the worker service?

sub get_var_from_parent {
    my ($self, $var) = @_;
    my $parents = get_parents();
    #Query every parent to find the var
    for my $job_id (@$parents) {
        my $ref = get_job_autoinst_vars($job_id);
        return $ref->{$var} if defined $ref->{$var};
    }
    return;
}

https://openqa.suse.de/tests/13885165/logfile?filename=autoinst-log.txt

[2024-03-27T15:39:25.691962Z] [debug] [pid:4639] get_job_autoinst_vars: Connection error: Can't connect: Name or service not known; URL was http://worker35:20493/wS5wkxkWNNB9LK92/vars

History
Notes
Property changes

Actions

Copy link

Updated by okurz 9 months ago

Category set to Regressions/Crashes
Priority changed from Normal to Urgent
Target version set to Ready

Actions

Copy link

Updated by ybonatakis 9 months ago

shouldnt the WORKER_CLASS include tap device?
Anyhow WORKER_CLASS differs from last successful job

"WORKER_CLASS" : "virt-mm-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,64bit-ipmi-large-mem-intel,64bit-ipmi-ipxe,scooter,region-nue,datacenter-frankencampus,location-nue2,imagetester,cpu-x86_64,cpu-x86_64-v2", vs
"WORKER_CLASS" : "virt-mm-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,fozzie,region-prg,datacenter-dc7,location-prg2,worker35,cpu-x86_64,cpu-x86_64-v2,cpu-x86_64-v3",

Actions

Copy link

Updated by livdywan 9 months ago

Subject changed from parallel job failed to get the vars from its pair to parallel job failed to get the vars from its pair size:S
Description updated (diff)
Category changed from Regressions/Crashes to Support
Status changed from New to Workable

Actions

Copy link

Updated by mkittler 9 months ago

Status changed from Workable to In Progress
Assignee set to mkittler

Actions

Copy link

Updated by mkittler 9 months ago

I think those two conditions lead to the observed problem:

Those jobs were executed across imagetester.qe.nue2.suse.org and worker35.oqa.prg2.suse.org. Those two workers are under different domains which also makes sense as they are on completely different locations.
The get_job_autoinst_vars function is not using the worker's FQDN (even though the FQDN is known to openQA via the WORKER_HOSTNAME worker property).

The problem is that worker35 cannot be resolved from imagetester.qe.nue2.suse.org. This would only be possible from workers under the same domain like worker34.oqa.prg2.suse.org. Normally we only run MM jobs across workers on the same location so not using the FQDN is normally not a problem (and get_job_autoinst_vars is probably rarely used anyway).

The involved jobs have both just virt-mm-64bit-ipmi as worker class. So they are allowed to run on non-tap machines on any location. So the openQA scheduler didn't screw things up.

We can solve this by improving at least one of the mentioned conditions:

Specify a worker class that is only present on workers on a certain location to avoid running those jobs across different locations. You may also use the newly added dependency pinning feature by adding PARALLEL_ONE_HOST_ONLY=1 next to your PARALLEL_WITH=… settings. (This feature is still subject to change as mentioned in the linked documentation. However, you're welcome to give it a try as we need feedback on that feature anyway.)
Make get_job_autoinst_vars use the FQDN. This requires a code change in os-autoinst and openQA. I'll have look how complicated it would be.

Considering that you might run into connectivity problems or performance limitations when running those jobs across workers on different locations I suggest you look into 1. I would meanwhile nevertheless see how much sense it would make for us to implement 2.

Actions

Copy link

Updated by mkittler 9 months ago

Status changed from In Progress to Feedback

This PR will fix get_job_autoinst_vars for this case: https://github.com/os-autoinst/os-autoinst/pull/2478

However, I suggest you still look into avoiding cross-location MM jobs.

Actions

Copy link

Updated by mkittler 9 months ago

Priority changed from Urgent to Normal

Actions

Copy link

Updated by Julie_CAO 9 months ago

Thank you for the quick fix!

However, I suggest you still look into avoiding cross-location MM jobs.

Now we have 3 mm machines, one of them was just relocated in PRG2e. It's difficult for us to avoid using it in the short term, but in the long run, we really need to consider this.

Actions

Copy link

Updated by Julie_CAO 9 months ago

Status changed from Feedback to Resolved

Verify passed. thanks again. https://openqa.suse.de/tests/13919095

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #158185

parallel job failed to get the vars from its pair size:S

Observation¶

Updated by okurz 9 months ago

Updated by ybonatakis 9 months ago

Updated by livdywan 9 months ago

Updated by mkittler 9 months ago

Updated by mkittler 9 months ago

Updated by mkittler 9 months ago

Updated by mkittler 9 months ago

Updated by Julie_CAO 9 months ago

Updated by Julie_CAO 9 months ago