[tools][openqa][research] Research on Federated openQA
In order to support testing with openQA from different physical locations, we will require to do some research on the possible ways to implement a federated openQA.
The idea would be to support scenarios 4, 2 and 1 (order of priority) defined in poo#20514
#2 Updated by szarate almost 3 years ago
For now the initial idea discussed with coolo, was to have a worker on a second instance picking up jobs, and adding them to it's own master.
I believe that for this we would a fair number of bits/features in openQA, but mainly:
- The possibility for the scheduler on the master webUI, to know which jobs need to be ran on a specific cluster), and the master webUI should handle these jobs as asynchronous or something similar
- Each cluster, will have it's own scheduler/worker (I'd prefer to give this responsability to the scheduler) checking constantly the master webUI for new builds+jobs
- When new builds are detected, they are synced based on capabilities (i.e, Location supports ISOs, QCOW2, repos)
- Jobs that have been added to a second cluster, will only be triggered when the build has been synced
- Once a job is finished, the master webUI gets a notification, along with the job results (could be bulk or per build?)
#4 Updated by coolo almost 3 years ago
The possibility for the scheduler on the master webUI, to know which jobs need to be ran on a specific cluster), and the master webUI should handle these jobs as asynchronous or something similar
This can be done through worker classes I believe.
As first approach really make both openqa instances ignorant to the concept and try to have all business logic in a briding worker - grabing jobs for X worker classes, syncing, scheduling them to another instance, wait for result, send them back.
And then let's see what support we need. But the idea is not to make openqa job scheduling even more complex
#6 Updated by szarate almost 3 years ago
- Status changed from New to Resolved
The research was done and sparked few questions, but first things first:
- While working on this, in conversations with coolo,the idea of creating a worker_bridge came into play and resulted into the following PR: https://github.com/os-autoinst/openQA/pull/1414
- This approach takes for granted that a slaveUI (openQA instance present in a separate location), has a similar setup as the openQA instance that we have in production.
- The worker bridge will run on a machine and will have access to a masterUI and to a slaveUI (1 worker bridge = 1 slave), and will query the masterUI for jobs that have WORKER_CLASS=:my_location:our_worker_class
- Once the worker_bridge finds jobs on the masterUI that belong to it's instance, it will clone them and add
federated_reportto the job settings before posting it to the slaveUI
- The masterUI should have the possibility to filter job list by job setting, avoid generating extra load when searching for jobs
- The worker bridge will monitor jobs on the slaveUI that have the job setting
proxied, or use
federated_reportand report progress to it's masterUI (stored in federated_report), when the job is done, regardless of the state.
There is some progress on few of this areas:
- Add support for colons on worker class: https://github.com/os-autoinst/openQA/pull/1408
- Add support for getting test results as json on the job json api: https://github.com/os-autoinst/openQA/pull/1424
Important things to know:
- Reporting of job status to the masterUI is not finished
- The worker bridge needs a refactor, as it was created as a proof of concept
- /test api route needs to display all test results properly.
- There should be some kind of transactionability, in case the clonning of jobs, fails...
Main question right now is: When is the worker_bridge triggered or when should it start working to pull jobs from the masterUI. In a conversation with coolo in the last review, he suggested that the worker_bridge downloads the assets.