action #20526

[tools][openqa][research] Research on Federated openQA

Added by szarate over 2 years ago. Updated 5 months ago.

Status:ResolvedStart date:20/11/2017
Priority:NormalDue date:
Assignee:szarate% Done:

100%

Category:Feature requests
Target version:Milestone 9
Difficulty:
Duration:

Description

In order to support testing with openQA from different physical locations, we will require to do some research on the possible ways to implement a federated openQA.

The idea would be to support scenarios 4, 2 and 1 (order of priority) defined in poo#20514


Subtasks

action #25278: [epic]Refactor worker_bridge for federated openQA supportRejected

action #27955: Allow the worker_bridge to sync job status from a slaveUI...Rejected

action #27958: [tools][Sprint 201711.2] Allow filtering of jobs by worke...Resolvedszarate

History

#2 Updated by szarate over 2 years ago

For now the initial idea discussed with @coolo, was to have a worker on a second instance picking up jobs, and adding them to it's own master.

I believe that for this we would a fair number of bits/features in openQA, but mainly:

  • The possibility for the scheduler on the master webUI, to know which jobs need to be ran on a specific cluster), and the master webUI should handle these jobs as asynchronous or something similar
  • Each cluster, will have it's own scheduler/worker (I'd prefer to give this responsability to the scheduler) checking constantly the master webUI for new builds+jobs
  • When new builds are detected, they are synced based on capabilities (i.e, Location supports ISOs, QCOW2, repos)
  • Jobs that have been added to a second cluster, will only be triggered when the build has been synced
  • Once a job is finished, the master webUI gets a notification, along with the job results (could be bulk or per build?)

#3 Updated by szarate over 2 years ago

I also came across the achitecture of Apache Mesos, which spawned couple of ideas, but mainly the idea looking into zookeeper

#4 Updated by coolo over 2 years ago

The possibility for the scheduler on the master webUI, to know which jobs need to be ran on a specific cluster), and the master webUI should handle these jobs as asynchronous or something similar

This can be done through worker classes I believe.

As first approach really make both openqa instances ignorant to the concept and try to have all business logic in a briding worker - grabing jobs for X worker classes, syncing, scheduling them to another instance, wait for result, send them back.

And then let's see what support we need. But the idea is not to make openqa job scheduling even more complex

#5 Updated by coolo over 2 years ago

Please! Stop googling random key words - KISS!

#6 Updated by szarate over 2 years ago

  • Status changed from New to Resolved

The research was done and sparked few questions, but first things first:

  • While working on this, in conversations with @coolo,the idea of creating a worker_bridge came into play and resulted into the following PR: https://github.com/os-autoinst/openQA/pull/1414
    • This approach takes for granted that a slaveUI (openQA instance present in a separate location), has a similar setup as the openQA instance that we have in production.
    • The worker bridge will run on a machine and will have access to a masterUI and to a slaveUI (1 worker bridge = 1 slave), and will query the masterUI for jobs that have WORKER_CLASS=:my_location:our_worker_class
    • Once the worker_bridge finds jobs on the masterUI that belong to it's instance, it will clone them and add proxied and federated_report to the job settings before posting it to the slaveUI
    • The masterUI should have the possibility to filter job list by job setting, avoid generating extra load when searching for jobs
    • The worker bridge will monitor jobs on the slaveUI that have the job setting proxied, or use federated_report and report progress to it's masterUI (stored in federated_report), when the job is done, regardless of the state.

There is some progress on few of this areas:

Important things to know:

  • Reporting of job status to the masterUI is not finished
  • The worker bridge needs a refactor, as it was created as a proof of concept
  • /test api route needs to display all test results properly.
  • There should be some kind of transactionability, in case the clonning of jobs, fails...

Main question right now is: When is the worker_bridge triggered or when should it start working to pull jobs from the masterUI. In a conversation with @coolo in the last review, he suggested that the worker_bridge downloads the assets.

#7 Updated by szarate over 2 years ago

  • Due date set to 20/11/2017
  • Start date changed from 14/09/2017 to 20/11/2017

due to changes in a related task

Also available in: Atom PDF