Project

General

Profile

Actions

coordination #99549

open

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

[epic] Split production workload onto multiple hosts (focusing on OSD)

Added by mkittler about 3 years ago. Updated about 3 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2021-09-30
Due date:
% Done:

0%

Estimated time:

Description

motivation

It is not possible to simply extend the OSD VM by additional CPU cores (see #97943). However, the host is definitely overloaded. At least from time to time, e.g. when extraordinarily many jobs are scheduled, it is very slow and we see 503 responses.

It would be possible to request another VM, though. Therefore it would make sense to evaluate how the workload can be split onto multiple hosts.

options

We've already discussed the topic and found that there are multiple options. Some of them could be combined. The following list is not an order, I only used numbers for easier referencing:

  1. Have an additional, completely independent openQA setup which would only share workers.
    1. Advantage: It is as easy as it is to setup a new openQA instance. No further openQA features or special setup tweaks would be required.
    2. Disadvantage: The split is user-visible and requires a high coordination with users. Possibly it is not wanted at all.
  2. Allow executing certain Minion tasks on a different host.
    1. It would likely not make sense for (cleanup) tasks which mainly cause filesystem load (and they'd just use the main VMs filesystem via NFS after all).
    2. Not sure how well this use-case is supported by Minion in particular when we only can run a subset of tasks on a different host.
  3. Run openQA web UI workers on the other host.
    1. This would require sharing the storage, e.g. via NFS.
    2. Or would it be possible to move only certain routes which do not rely on the storage?
  4. Run additional services like scheduler, web socket server and livehandler on a different host.
    1. We could likely get away without requiring access to the storage on the additional host.
    2. Likely only a slight improvement.
  5. Run the PostgreSQL service on a different host.
    1. It would be interesting to monitor how much CPU usage the PostgreSQL database causes.
    2. Telegraf queries via PostgreSQL could be moved to the database host as well.

suggestions

  • Think about further options because I surely missed some.
  • Evaluate certain options more closely.
  • Strike out unfeasible options.
  • Monitor the current resource utilization more closely, e.g. to determine how much CPU load the different services like PostgreSQL actually cause.
  • Create sub tasks for further concrete actions.

Related issues 2 (0 open2 closed)

Related to openQA Infrastructure (public) - action #97943: Increase number of CPU cores on OSD VM due to high usage size:SResolvedmkittler

Actions
Related to openQA Infrastructure (public) - coordination #112718: [alert][osd] openqa.suse.de is not reachable anymore, response times > 30s, multiple alerts over the weekendResolvedokurz2022-06-22

Actions
Actions

Also available in: Atom PDF