coordination #99549


coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

[epic] Split production workload onto multiple hosts (focusing on OSD)

Added by mkittler over 2 years ago. Updated over 2 years ago.

Feature requests
Target version:
Start date:
Due date:
% Done:


Estimated time:



It is not possible to simply extend the OSD VM by additional CPU cores (see #97943). However, the host is definitely overloaded. At least from time to time, e.g. when extraordinarily many jobs are scheduled, it is very slow and we see 503 responses.

It would be possible to request another VM, though. Therefore it would make sense to evaluate how the workload can be split onto multiple hosts.


We've already discussed the topic and found that there are multiple options. Some of them could be combined. The following list is not an order, I only used numbers for easier referencing:

  1. Have an additional, completely independent openQA setup which would only share workers.
    1. Advantage: It is as easy as it is to setup a new openQA instance. No further openQA features or special setup tweaks would be required.
    2. Disadvantage: The split is user-visible and requires a high coordination with users. Possibly it is not wanted at all.
  2. Allow executing certain Minion tasks on a different host.
    1. It would likely not make sense for (cleanup) tasks which mainly cause filesystem load (and they'd just use the main VMs filesystem via NFS after all).
    2. Not sure how well this use-case is supported by Minion in particular when we only can run a subset of tasks on a different host.
  3. Run openQA web UI workers on the other host.
    1. This would require sharing the storage, e.g. via NFS.
    2. Or would it be possible to move only certain routes which do not rely on the storage?
  4. Run additional services like scheduler, web socket server and livehandler on a different host.
    1. We could likely get away without requiring access to the storage on the additional host.
    2. Likely only a slight improvement.
  5. Run the PostgreSQL service on a different host.
    1. It would be interesting to monitor how much CPU usage the PostgreSQL database causes.
    2. Telegraf queries via PostgreSQL could be moved to the database host as well.


  • Think about further options because I surely missed some.
  • Evaluate certain options more closely.
  • Strike out unfeasible options.
  • Monitor the current resource utilization more closely, e.g. to determine how much CPU load the different services like PostgreSQL actually cause.
  • Create sub tasks for further concrete actions.

Related issues 2 (0 open2 closed)

Related to openQA Infrastructure - action #97943: Increase number of CPU cores on OSD VM due to high usage size:SResolvedmkittler

Related to openQA Infrastructure - coordination #112718: [alert][osd] is not reachable anymore, response times > 30s, multiple alerts over the weekendResolvedokurz2022-06-22

Actions #1

Updated by okurz over 2 years ago

  • Related to action #97943: Increase number of CPU cores on OSD VM due to high usage size:S added
Actions #2

Updated by okurz over 2 years ago

  • Target version set to future

actually the saga for "scale out to multiple containers" and alike is #80142

Actions #3

Updated by tinita over 2 years ago

I would vote for 5, because it shouldn't be too hard and would be the first step for me dividing load.
It would even make monitoring the database itself easier, since we can just look at the load of the db host.

Actions #4

Updated by mkittler over 2 years ago

  • Description updated (diff)
Actions #5

Updated by osukup over 2 years ago

5 is most easy option :D but only partial solution.

1 is easy way to scale problems on new level. And if I remember correctly in past was considered to have two openQA instances, one for production QA and Second for Maintenance which was in end rejected. ( but it was before implementation of Shared workers )

there is also possibility to check performace of apache and move it on own server or use something lighter like nGinx ?

Actions #6

Updated by okurz almost 2 years ago

  • Related to coordination #112718: [alert][osd] is not reachable anymore, response times > 30s, multiple alerts over the weekend added

Also available in: Atom PDF