action #158185
closed
parallel job failed to get the vars from its pair size:S
Added by Julie_CAO 9 months ago.
Updated 9 months ago.
Description
Observation¶
We have a parallel job which failed in getting the vars from its pair. Rerun still failed. Is there something wrong with the worker service?
sub get_var_from_parent {
my ($self, $var) = @_;
my $parents = get_parents();
#Query every parent to find the var
for my $job_id (@$parents) {
my $ref = get_job_autoinst_vars($job_id);
return $ref->{$var} if defined $ref->{$var};
}
return;
}
https://openqa.suse.de/tests/13885165/logfile?filename=autoinst-log.txt
[2024-03-27T15:39:25.691962Z] [debug] [pid:4639] get_job_autoinst_vars: Connection error: Can't connect: Name or service not known; URL was http://worker35:20493/wS5wkxkWNNB9LK92/vars
- Category set to Regressions/Crashes
- Priority changed from Normal to Urgent
- Target version set to Ready
shouldnt the WORKER_CLASS include tap
device?
Anyhow WORKER_CLASS differs from last successful job
- "WORKER_CLASS" : "virt-mm-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,64bit-ipmi-large-mem-intel,64bit-ipmi-ipxe,scooter,region-nue,datacenter-frankencampus,location-nue2,imagetester,cpu-x86_64,cpu-x86_64-v2",
vs
- "WORKER_CLASS" : "virt-mm-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,fozzie,region-prg,datacenter-dc7,location-prg2,worker35,cpu-x86_64,cpu-x86_64-v2,cpu-x86_64-v3",
- Subject changed from parallel job failed to get the vars from its pair to parallel job failed to get the vars from its pair size:S
- Description updated (diff)
- Category changed from Regressions/Crashes to Support
- Status changed from New to Workable
- Status changed from Workable to In Progress
- Assignee set to mkittler
I think those two conditions lead to the observed problem:
- Those jobs were executed across imagetester.qe.nue2.suse.org and worker35.oqa.prg2.suse.org. Those two workers are under different domains which also makes sense as they are on completely different locations.
- The
get_job_autoinst_vars
function is not using the worker's FQDN (even though the FQDN is known to openQA via the WORKER_HOSTNAME
worker property).
The problem is that worker35 cannot be resolved from imagetester.qe.nue2.suse.org. This would only be possible from workers under the same domain like worker34.oqa.prg2.suse.org. Normally we only run MM jobs across workers on the same location so not using the FQDN is normally not a problem (and get_job_autoinst_vars
is probably rarely used anyway).
The involved jobs have both just virt-mm-64bit-ipmi
as worker class. So they are allowed to run on non-tap machines on any location. So the openQA scheduler didn't screw things up.
We can solve this by improving at least one of the mentioned conditions:
- Specify a worker class that is only present on workers on a certain location to avoid running those jobs across different locations. You may also use the newly added dependency pinning feature by adding
PARALLEL_ONE_HOST_ONLY=1
next to your PARALLEL_WITH=…
settings. (This feature is still subject to change as mentioned in the linked documentation. However, you're welcome to give it a try as we need feedback on that feature anyway.)
- Make
get_job_autoinst_vars
use the FQDN. This requires a code change in os-autoinst and openQA. I'll have look how complicated it would be.
Considering that you might run into connectivity problems or performance limitations when running those jobs across workers on different locations I suggest you look into 1. I would meanwhile nevertheless see how much sense it would make for us to implement 2.
- Status changed from In Progress to Feedback
- Priority changed from Urgent to Normal
Thank you for the quick fix!
However, I suggest you still look into avoiding cross-location MM jobs.
Now we have 3 mm machines, one of them was just relocated in PRG2e. It's difficult for us to avoid using it in the short term, but in the long run, we really need to consider this.
- Status changed from Feedback to Resolved
Also available in: Atom
PDF