action #158185
closedparallel job failed to get the vars from its pair size:S
0%
Description
Observation¶
We have a parallel job which failed in getting the vars from its pair. Rerun still failed. Is there something wrong with the worker service?
sub get_var_from_parent {
my ($self, $var) = @_;
my $parents = get_parents();
#Query every parent to find the var
for my $job_id (@$parents) {
my $ref = get_job_autoinst_vars($job_id);
return $ref->{$var} if defined $ref->{$var};
}
return;
}
https://openqa.suse.de/tests/13885165/logfile?filename=autoinst-log.txt
[2024-03-27T15:39:25.691962Z] [debug] [pid:4639] get_job_autoinst_vars: Connection error: Can't connect: Name or service not known; URL was http://worker35:20493/wS5wkxkWNNB9LK92/vars
Updated by ybonatakis 8 months ago
shouldnt the WORKER_CLASS include tap
device?
Anyhow WORKER_CLASS differs from last successful job
- "WORKER_CLASS" : "virt-mm-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,64bit-ipmi-large-mem-intel,64bit-ipmi-ipxe,scooter,region-nue,datacenter-frankencampus,location-nue2,imagetester,cpu-x86_64,cpu-x86_64-v2", vs
- "WORKER_CLASS" : "virt-mm-64bit-ipmi,64bit-ipmi,64bit-ipmi-large-mem,fozzie,region-prg,datacenter-dc7,location-prg2,worker35,cpu-x86_64,cpu-x86_64-v2,cpu-x86_64-v3",
Updated by mkittler 8 months ago
I think those two conditions lead to the observed problem:
- Those jobs were executed across imagetester.qe.nue2.suse.org and worker35.oqa.prg2.suse.org. Those two workers are under different domains which also makes sense as they are on completely different locations.
- The
get_job_autoinst_vars
function is not using the worker's FQDN (even though the FQDN is known to openQA via theWORKER_HOSTNAME
worker property).
The problem is that worker35 cannot be resolved from imagetester.qe.nue2.suse.org. This would only be possible from workers under the same domain like worker34.oqa.prg2.suse.org. Normally we only run MM jobs across workers on the same location so not using the FQDN is normally not a problem (and get_job_autoinst_vars
is probably rarely used anyway).
The involved jobs have both just virt-mm-64bit-ipmi
as worker class. So they are allowed to run on non-tap machines on any location. So the openQA scheduler didn't screw things up.
We can solve this by improving at least one of the mentioned conditions:
- Specify a worker class that is only present on workers on a certain location to avoid running those jobs across different locations. You may also use the newly added dependency pinning feature by adding
PARALLEL_ONE_HOST_ONLY=1
next to yourPARALLEL_WITH=…
settings. (This feature is still subject to change as mentioned in the linked documentation. However, you're welcome to give it a try as we need feedback on that feature anyway.) - Make
get_job_autoinst_vars
use the FQDN. This requires a code change in os-autoinst and openQA. I'll have look how complicated it would be.
Considering that you might run into connectivity problems or performance limitations when running those jobs across workers on different locations I suggest you look into 1. I would meanwhile nevertheless see how much sense it would make for us to implement 2.
Updated by mkittler 8 months ago
- Status changed from In Progress to Feedback
This PR will fix get_job_autoinst_vars
for this case: https://github.com/os-autoinst/os-autoinst/pull/2478
However, I suggest you still look into avoiding cross-location MM jobs.
Updated by Julie_CAO 8 months ago
Thank you for the quick fix!
However, I suggest you still look into avoiding cross-location MM jobs.
Now we have 3 mm machines, one of them was just relocated in PRG2e. It's difficult for us to avoid using it in the short term, but in the long run, we really need to consider this.
Updated by Julie_CAO 8 months ago
- Status changed from Feedback to Resolved
Verify passed. thanks again. https://openqa.suse.de/tests/13919095