action #182681
opencoordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances
coordination #178243: [epic] More efficient handling of big job schedules, not executable jobs, never matching worker classes, etc.
Dynamic openQA worker(s) spinoff during high load
0%
Description
Motivation¶
It's been observed that sometimes the worker slots become unavailable with a message something similar:
Unavailable: The average load (28.34 27.85 21.15) is exceeding the configured threshold of 25. The worker will temporarily not accept new jobs until the load is lower again.
This unavailability of slots can cause other tests to just keep waiting in the scheduler queue for day(s) further delaying overall deliverable.
We should come up with some solution such that a temporary and exact copy of the unavailable worker slot get created on a separate machine and gets registered to openQA-webUI automatically and then later should be deleted and unregistered from webUI.
Suggestions¶
- A spare baremetal machine with no openQA setup should be connected to openQA maybe via IPMI or via HMC (for PPC workers)
- The baremetal machine should be restored to its original state after teardown of openQA workers when the test finishes