action #20812
closed
Jobs will be assigned to workers with wrong arch unless WORKER_CLASS is set somewhere
Added by AdamWill over 7 years ago.
Updated almost 6 years ago.
Category:
Feature requests
Description
I think since commit fd3c570f8f4554037ffae1179742b9025390eabe , there doesn't seem to be any simple arch-based protection against jobs running on a worker of the wrong arch any more. The %cando matrix in Common.pm is still there, but if you trace it out, the code which ultimately decides whether a job is appropriate - job_grab , in Scheduler/Scheduler.pm - never actually cares about it any more. The values from it get passed into job_grab as the 'workercaps' arg, and the only thing the function does with 'workercaps' is pass it back to the worker (when it does $worker->seen($workercaps)
); it does nothing else with those values any more. So unless the test suite, machine or product specifies WORKER_CLASS , openQA will happily go ahead and try to run an x86_64 job on a ppc64 worker. To cite an entirely random example. Or, you know, possibly not entirely random:
https://openqa.fedoraproject.org/admin/workers/19
I'm gonna go ahead and add WORKER_CLASS to all our machine definitions in our distri to fix our instance, but I do think it's worth reporting that openQA does the wrong thing if WORKER_CLASS isn't explicitly set.
- Target version set to Ready
The protection might not be so useful on larger clusters where requiring WORKER_CLASS would be the easier solution. But test developers have single host installations - and we need to protect them from running random architectures :)
A bit tricky in practice for s390x jobs, where workers are actually x86_64 and CPU_ARCH is set to that value
Just to clarify in terms of ACs:
- Make the scheduler aware of the worker's jobs capabilities and do not assign jobs to those with a different architecture
- While doing this, take into account when workers are executing jobs in different platforms - either worker explicitly declaring that, or inferring it with a different mechanism
Indeed, we have a similar case with running an ARM test on x86_64 (using extreeeemeeeely slooooooooow emulation).
I mean, it's possible there's no really great fix here. If attempting to fix it gets too complex there's probably a point at which we should just stop, throw out the %cando matrix, and document "you should do this in instance config with worker classes". I don't think that's too terrible so long as it's written down.
- Related to action #33580: Jobs are assigned to workers with different backend added
I wouldn't care for deployments with complicated workers - admins of those need to read documentation. But jobs post and isos post should take care that we have a WORKER_CLASS - and default to qemu_$ARCH to make the result predicatable.
- Assignee set to mkittler
- Target version changed from Ready to Current Sprint
- Status changed from New to In Progress
- Status changed from In Progress to Resolved
Also available in: Atom
PDF