Project

General

Profile

action #20812

Jobs will be assigned to workers with wrong arch unless WORKER_CLASS is set somewhere

Added by AdamWill almost 3 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
easy
Duration:

Description

I think since commit fd3c570f8f4554037ffae1179742b9025390eabe , there doesn't seem to be any simple arch-based protection against jobs running on a worker of the wrong arch any more. The %cando matrix in Common.pm is still there, but if you trace it out, the code which ultimately decides whether a job is appropriate - job_grab , in Scheduler/Scheduler.pm - never actually cares about it any more. The values from it get passed into job_grab as the 'workercaps' arg, and the only thing the function does with 'workercaps' is pass it back to the worker (when it does $worker->seen($workercaps) ); it does nothing else with those values any more. So unless the test suite, machine or product specifies WORKER_CLASS , openQA will happily go ahead and try to run an x86_64 job on a ppc64 worker. To cite an entirely random example. Or, you know, possibly not entirely random:

https://openqa.fedoraproject.org/admin/workers/19

I'm gonna go ahead and add WORKER_CLASS to all our machine definitions in our distri to fix our instance, but I do think it's worth reporting that openQA does the wrong thing if WORKER_CLASS isn't explicitly set.


Related issues

Related to openQA Project - action #32851: [tools][EPIC] Scheduling redesignResolved2018-05-05

Related to openQA Project - action #33580: Jobs are assigned to workers with different backendNew2018-03-21

History

#1 Updated by coolo over 2 years ago

  • Target version set to Ready

#2 Updated by coolo over 2 years ago

The protection might not be so useful on larger clusters where requiring WORKER_CLASS would be the easier solution. But test developers have single host installations - and we need to protect them from running random architectures :)

#3 Updated by dasantiago over 2 years ago

  • Related to action #32851: [tools][EPIC] Scheduling redesign added

#4 Updated by EDiGiacinto over 2 years ago

A bit tricky in practice for s390x jobs, where workers are actually x86_64 and CPU_ARCH is set to that value

Just to clarify in terms of ACs:

  • Make the scheduler aware of the worker's jobs capabilities and do not assign jobs to those with a different architecture
  • While doing this, take into account when workers are executing jobs in different platforms - either worker explicitly declaring that, or inferring it with a different mechanism

#5 Updated by AdamWill over 2 years ago

Indeed, we have a similar case with running an ARM test on x86_64 (using extreeeemeeeely slooooooooow emulation).

I mean, it's possible there's no really great fix here. If attempting to fix it gets too complex there's probably a point at which we should just stop, throw out the %cando matrix, and document "you should do this in instance config with worker classes". I don't think that's too terrible so long as it's written down.

#6 Updated by dasantiago over 2 years ago

  • Related to action #33580: Jobs are assigned to workers with different backend added

#7 Updated by coolo about 2 years ago

  • Difficulty set to easy

I wouldn't care for deployments with complicated workers - admins of those need to read documentation. But jobs post and isos post should take care that we have a WORKER_CLASS - and default to qemu_$ARCH to make the result predicatable.

#8 Updated by mkittler over 1 year ago

  • Assignee set to mkittler
  • Target version changed from Ready to Current Sprint

#9 Updated by mkittler over 1 year ago

  • Status changed from New to In Progress

#10 Updated by mkittler over 1 year ago

  • Status changed from In Progress to Resolved

PR has been merged

Also available in: Atom PDF