action #20812
closedJobs will be assigned to workers with wrong arch unless WORKER_CLASS is set somewhere
0%
Description
I think since commit fd3c570f8f4554037ffae1179742b9025390eabe , there doesn't seem to be any simple arch-based protection against jobs running on a worker of the wrong arch any more. The %cando matrix in Common.pm is still there, but if you trace it out, the code which ultimately decides whether a job is appropriate - job_grab , in Scheduler/Scheduler.pm - never actually cares about it any more. The values from it get passed into job_grab as the 'workercaps' arg, and the only thing the function does with 'workercaps' is pass it back to the worker (when it does $worker->seen($workercaps)
); it does nothing else with those values any more. So unless the test suite, machine or product specifies WORKER_CLASS , openQA will happily go ahead and try to run an x86_64 job on a ppc64 worker. To cite an entirely random example. Or, you know, possibly not entirely random:
https://openqa.fedoraproject.org/admin/workers/19
I'm gonna go ahead and add WORKER_CLASS to all our machine definitions in our distri to fix our instance, but I do think it's worth reporting that openQA does the wrong thing if WORKER_CLASS isn't explicitly set.
Updated by coolo almost 7 years ago
The protection might not be so useful on larger clusters where requiring WORKER_CLASS would be the easier solution. But test developers have single host installations - and we need to protect them from running random architectures :)
Updated by dasantiago over 6 years ago
- Related to coordination #32851: [tools][EPIC] Scheduling redesign added
Updated by EDiGiacinto over 6 years ago
A bit tricky in practice for s390x jobs, where workers are actually x86_64 and CPU_ARCH is set to that value
Just to clarify in terms of ACs:
- Make the scheduler aware of the worker's jobs capabilities and do not assign jobs to those with a different architecture
- While doing this, take into account when workers are executing jobs in different platforms - either worker explicitly declaring that, or inferring it with a different mechanism
Updated by AdamWill over 6 years ago
Indeed, we have a similar case with running an ARM test on x86_64 (using extreeeemeeeely slooooooooow emulation).
I mean, it's possible there's no really great fix here. If attempting to fix it gets too complex there's probably a point at which we should just stop, throw out the %cando matrix, and document "you should do this in instance config with worker classes". I don't think that's too terrible so long as it's written down.
Updated by dasantiago over 6 years ago
- Related to action #33580: Jobs are assigned to workers with different backend added
Updated by coolo over 6 years ago
- Difficulty set to easy
I wouldn't care for deployments with complicated workers - admins of those need to read documentation. But jobs post and isos post should take care that we have a WORKER_CLASS - and default to qemu_$ARCH to make the result predicatable.
Updated by mkittler over 5 years ago
- Assignee set to mkittler
- Target version changed from Ready to Current Sprint
Updated by mkittler over 5 years ago
- Status changed from New to In Progress