Project

General

Profile

Actions

action #69355

closed

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

[spike] redundant/load-balancing webui deployments of openQA

Added by okurz over 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
Feature requests
Target version:
Start date:
2020-07-25
Due date:
% Done:

0%

Estimated time:

Description

Motivation

Single instances of the webui can cause longer downtimes and make upgrades of OS more risky, e.g. when we do not have management access to VMs that might fail to reboot. Also, load-balancing can help as well as having switch-over deployments possible for easier testing, staging, etc.

Acceptance criteria

  • AC1: a proof of concept exists, eg. a test instance setup that can be reproduced
  • AC2: documentation exists how to setup redundant load-balancing infrastructures

Suggestions

As our worker setup already has good inherent redundancy and load-balancing start with the webui server

state-of-the-art is k8s so we should aim for that. Maybe a "docker compose" file is a good intermediate step, then k8s with a helm chart, potentially also some setup based on gitlab, see
https://docs.gitlab.com/ee/ci/environments/incremental_rollouts.html#blue-green-deployment


Related issues 6 (2 open4 closed)

Related to openQA Infrastructure - action #65154: root partition on osd exceeds alert threshold, 90%, after osd deployment -> apply automatic reboots to OSD machinesResolvedokurz2020-04-012020-08-04

Actions
Related to openQA Project - action #55262: Install Pgpool-II or PgBouncer before PostgreSQL for openQA instances, e.g. to be used on OSDNew2019-08-08

Actions
Related to openQA Project - action #43715: Update upstream dockerfiles to provide an easy to use docker image of workersResolvedilausuch2018-11-13

Actions
Related to openQA Project - action #43712: Update upstream dockerfiles to provide an easy to use docker image of openQA-webuiResolvedilausuch2018-11-13

Actions
Related to openQA Project - action #41600: fallback mechanism for apache, e.g. on osd New2018-09-26

Actions
Copied to openQA Project - action #76990: Improve documentation for redundant/load-balancing webui deployments of openQAResolvedmkittler

Actions
Actions #1

Updated by okurz over 4 years ago

  • Related to action #65154: root partition on osd exceeds alert threshold, 90%, after osd deployment -> apply automatic reboots to OSD machines added
Actions #2

Updated by okurz over 4 years ago

  • Related to action #55262: Install Pgpool-II or PgBouncer before PostgreSQL for openQA instances, e.g. to be used on OSD added
Actions #3

Updated by mkittler about 4 years ago

start with the webui server

The web UI server runs a lot of services and provides the filesystem /var/lib/openqa so this raises a lot of questions:

  1. I assume there would still be just one database.
  2. What about files within /var/lib/openqa? I suppose this file system still needs to be shared between the web UIs, e.g. using NFS like we already share /var/lib/openqa/share with the worker hosts.
  3. What about the scheduler? I suppose there should still be just one to avoid race conditions.
  4. What about the liveviewhandler and websocket server? Both are so far separate services because only a single process is supposed to serve the routes they provide because they rely on resources which needed to be shared between multiple instances if there were multiple instances. In case of the liveviewhandler that resource is websocket connection to another server so sharing it would be quite hard. So there should be only one of these services (in their current form).
  5. What about the Minion service? Since we rely on locking to avoid running jobs in parallel which shouldn't run in parallel I'd say there should only be one Minion service as well or multiple Minion services would share the same database. These jobs often involve a lot filesystem access under /var/lib/openqa so likely having multiple Minion services (sharing the same database) is not very beneficial if these filesystem access would then be performed from a remote host.

Looks like we would really only run multiple instances of the openqa-webui.service and keep the other services on the web UI host as they are. Currently we're actually running multiple web UI processes via preforking. Maybe we could reduce the number of prefork processes then. Otherwise we would likely need to take care not to exceed the limit for PostgreSQL connections, e.g. by implementing #55262.


can cause longer downtimes and make upgrades of OS more risky, e.g. when we do not have management access to VMs that might fail to reboot.

But some host needs to provide the "entry point" and database, right? So if that host is down we would still have a downtime.


Also, load-balancing can help as well as having switch-over deployments possible for easier testing, staging, etc.

As long as we share the same database I would be very careful to run multiple instances of different openQA versions in parallel.

Actions #4

Updated by okurz about 4 years ago

mkittler wrote:

start with the webui server

The web UI server runs a lot of services and provides the filesystem /var/lib/openqa so this raises a lot of questions:

  1. I assume there would still be just one database.

either that or database replication

  1. What about files within /var/lib/openqa? I suppose this file system still needs to be shared between the web UIs, e.g. using NFS like we already share /var/lib/openqa/share with the worker hosts.

Yes, I do not see an immediate problem with that as we already have either scalability features in place, like minion jobs with locks, or the same problems we want to fix anyway, like when saving needles interferes with updating the git repo with fetchneedles.

  1. What about the scheduler? I suppose there should still be just one to avoid race conditions.

yes

  1. What about the liveviewhandler and websocket server? Both are so far separate services because only a single process is supposed to serve the routes they provide because they rely on resources which needed to be shared between multiple instances if there were multiple instances. In case of the liveviewhandler that resource is websocket connection to another server so sharing it would be quite hard. So there should be only one of these services (in their current form).

yes. This is why it is good that we have already separate services even though we do not make full use of the possibilities yet (like running multiple instances of webui service but not liveviewhandler and websocket)

  1. What about the Minion service? Since we rely on locking to avoid running jobs in parallel which shouldn't run in parallel I'd say there should only be one Minion service as well or multiple Minion services would share the same database. These jobs often involve a lot filesystem access under /var/lib/openqa so likely having multiple Minion services (sharing the same database) is not very beneficial if these filesystem access would then be performed from a remote host.

Looks like we would really only run multiple instances of the openqa-webui.service and keep the other services on the web UI host as they are. Currently we're actually running multiple web UI processes via preforking. Maybe we could reduce the number of prefork processes then. Otherwise we would likely need to take care not to exceed the limit for PostgreSQL connections, e.g. by implementing #55262.

exactly, same as now run multiple processes but potentially spread out over different hosts and be able to dynamically alter the number and control them individually, e.g. for blue/green deployments.

can cause longer downtimes and make upgrades of OS more risky, e.g. when we do not have management access to VMs that might fail to reboot.

But some host needs to provide the "entry point" and database, right? So if that host is down we would still have a downtime.

yes, but this host can be the one used for the http load-balancer which are more simple than openQA itself, have already built-in scalability features and can monitored for example by kubernetes itself.

[…]
As long as we share the same database I would be very careful to run multiple instances of different openQA versions in parallel.

yes, careful but I would still do it :) If you like we could have a simple experimentation task first which would be "run two openQA webui instances of different versions against a single database". But I guess with just some simple practices how we introduce schema changes we should be good. If not, then just shut off all old versions as soon as the schema is newer than the currently supported one.

Actions #5

Updated by mkittler about 4 years ago

If you like we could have a simple experimentation task first which would be "run two openQA webui instances of different versions against a single database".

We don't need an experimentation task. I can already tell you that it will sometimes work and sometimes now. If both versions use the same schema version it will work and otherwise not. The version with the newer schema will obviously migrate the database to that new version. The old version will run into database errors then. I know that too well because I often switch between different versions locally which use a different schema version.

Actions #6

Updated by ilausuch about 4 years ago

  • Status changed from Workable to In Progress
  • Assignee set to ilausuch
Actions #8

Updated by ilausuch about 4 years ago

Right now I am covering all the cases that has being mentioned on the PR review and also

  • removing the dependence on apache
  • adding a loadbalancer for the webui
  • configuring webui, livehandler, websockets, gru and scheduler as independent containers
webui_data_1          /bin/sh -c /usr/bin/tail - ...   Up
webui_db-admin_1      entrypoint.sh docker-php-e ...   Up      0.0.0.0:8080->8080/tcp
webui_db_1            docker-entrypoint.sh postgres    Up      5432/tcp
webui_gru_1           /root/run_openqa.sh              Up      9526/tcp, 9527/tcp, 9528/tcp, 9529/tcp
webui_haproxy_1       /docker-entrypoint.sh hapr ...   Up      0.0.0.0:9526->9526/tcp
webui_livehandler_1   /root/run_openqa.sh              Up      9526/tcp, 9527/tcp, 0.0.0.0:9528->9528/tcp, 9529/tcp
webui_scheduler_1     /root/run_openqa.sh              Up      9526/tcp, 9527/tcp, 9528/tcp, 9529/tcp
webui_websockets_1    /root/run_openqa.sh              Up      9526/tcp, 0.0.0.0:9527->9527/tcp, 9528/tcp, 9529/tcp
webui_webui_1         /root/run_openqa.sh              Up      0.0.0.0:32813->9526/tcp, 9527/tcp, 9528/tcp, 9529/tcp
webui_webui_2         /root/run_openqa.sh              Up      0.0.0.0:32814->9526/tcp, 9527/tcp, 9528/tcp, 9529/tcp
Actions #9

Updated by livdywan about 4 years ago

  • Related to action #43715: Update upstream dockerfiles to provide an easy to use docker image of workers added
Actions #10

Updated by livdywan about 4 years ago

  • Blocks action #43712: Update upstream dockerfiles to provide an easy to use docker image of openQA-webui added
Actions #11

Updated by okurz about 4 years ago

  • Blocks deleted (action #43712: Update upstream dockerfiles to provide an easy to use docker image of openQA-webui)
Actions #12

Updated by okurz about 4 years ago

  • Related to action #43712: Update upstream dockerfiles to provide an easy to use docker image of openQA-webui added
Actions #13

Updated by okurz about 4 years ago

  • Due date set to 2020-10-08

ilausuch wrote:

Right now I am covering all the cases that has being mentioned on the PR review and also

  • removing the dependence on apache
  • adding a loadbalancer for the webui
  • configuring webui, livehandler, websockets, gru and scheduler as independent containers

Sounds great. Please keep in mind the acceptance criteria in mind and see everything that is not immediately necessary for them to be out of scope. You can note down all additional ideas in this ticket or a followup or #65271 . Also please keep https://progress.opensuse.org/projects/qa/wiki/Wiki#SLOs-service-level-objectives in mind with "aim for cycle time of individual tickets (not epics or sagas): 1h-2w". Setting due date accordingly

Actions #14

Updated by ilausuch about 4 years ago

Some changes done in the PR. If you like it I'll proceed to document this version

Actions #15

Updated by ilausuch about 4 years ago

Added the documentation to the README.md

A question, should it be retro-compatible to run without docker-compose?
In case yes, I create a new commit that allows this retro-compatibility, so if you don't use the docker-compose is expected the same behavior as before.

Actions #16

Updated by ilausuch about 4 years ago

Before to close this task I would like to check the compatibility with the workers working in this ticket #43715

Actions #17

Updated by okurz about 4 years ago

  • Related to action #41600: fallback mechanism for apache, e.g. on osd added
Actions #18

Updated by ilausuch about 4 years ago

This ticket is bloqued until the mistery of https://progress.opensuse.org/issues/43715#note-22 will be solved

Actions #19

Updated by ilausuch about 4 years ago

  • Status changed from In Progress to Blocked
Actions #20

Updated by okurz about 4 years ago

  • Status changed from Blocked to Workable
  • Assignee deleted (ilausuch)

Ok, that makes sense. We should use "blocked" only when we have another ticket reference to wait for. Let's say "workable" for someone else to pickup and "solve" that mistery :)

Actions #21

Updated by okurz about 4 years ago

  • Priority changed from Normal to Low
Actions #22

Updated by ilausuch about 4 years ago

Problems solved with the worker. Now it is using nginx as LB. Preparing the auto scaling and the PR

Actions #23

Updated by ilausuch about 4 years ago

https://github.com/os-autoinst/openQA/pull/3488

This PR has the same purpose than previous one but using nginx to be able to integrate the websocket, livehandler in the path structure of the API.
This solves the communication from the workers with the api and the websockets using the same port

Actions #24

Updated by ilausuch about 4 years ago

  • Status changed from Workable to Resolved
  • Assignee set to ilausuch
Actions #25

Updated by ilausuch about 4 years ago

  • Status changed from Resolved to In Progress
Actions #26

Updated by ilausuch about 4 years ago

  • Status changed from In Progress to Resolved

Done

Actions #27

Updated by okurz about 4 years ago

  • Copied to action #76990: Improve documentation for redundant/load-balancing webui deployments of openQA added
Actions #28

Updated by okurz about 4 years ago

Thank you. To provide a little bit more information: We have a proof-of-concept with the instructions so far in https://github.com/os-autoinst/openQA/blob/master/docker/README.md . What we can do as next steps is to make sure this is included in https://open.qa/docs . Not in this ticket though, included now in #76990

Actions #29

Updated by okurz almost 4 years ago

  • Subject changed from redundant/load-balancing webui deployments of openQA to [spike] redundant/load-balancing webui deployments of openQA
Actions #30

Updated by okurz almost 4 years ago

  • Parent task set to #80142
Actions #31

Updated by okurz over 3 years ago

  • Due date deleted (2020-10-08)
Actions

Also available in: Atom PDF