action #69355
closedcoordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
[spike] redundant/load-balancing webui deployments of openQA
Description
Motivation¶
Single instances of the webui can cause longer downtimes and make upgrades of OS more risky, e.g. when we do not have management access to VMs that might fail to reboot. Also, load-balancing can help as well as having switch-over deployments possible for easier testing, staging, etc.
Acceptance criteria¶
- AC1: a proof of concept exists, eg. a test instance setup that can be reproduced
- AC2: documentation exists how to setup redundant load-balancing infrastructures
Suggestions¶
As our worker setup already has good inherent redundancy and load-balancing start with the webui server
state-of-the-art is k8s so we should aim for that. Maybe a "docker compose" file is a good intermediate step, then k8s with a helm chart, potentially also some setup based on gitlab, see
https://docs.gitlab.com/ee/ci/environments/incremental_rollouts.html#blue-green-deployment
Updated by okurz over 4 years ago
- Related to action #65154: root partition on osd exceeds alert threshold, 90%, after osd deployment -> apply automatic reboots to OSD machines added
Updated by okurz over 4 years ago
- Related to action #55262: Install Pgpool-II or PgBouncer before PostgreSQL for openQA instances, e.g. to be used on OSD added
Updated by mkittler about 4 years ago
start with the webui server
The web UI server runs a lot of services and provides the filesystem /var/lib/openqa
so this raises a lot of questions:
- I assume there would still be just one database.
- What about files within
/var/lib/openqa
? I suppose this file system still needs to be shared between the web UIs, e.g. using NFS like we already share/var/lib/openqa/share
with the worker hosts. - What about the scheduler? I suppose there should still be just one to avoid race conditions.
- What about the liveviewhandler and websocket server? Both are so far separate services because only a single process is supposed to serve the routes they provide because they rely on resources which needed to be shared between multiple instances if there were multiple instances. In case of the liveviewhandler that resource is websocket connection to another server so sharing it would be quite hard. So there should be only one of these services (in their current form).
- What about the Minion service? Since we rely on locking to avoid running jobs in parallel which shouldn't run in parallel I'd say there should only be one Minion service as well or multiple Minion services would share the same database. These jobs often involve a lot filesystem access under
/var/lib/openqa
so likely having multiple Minion services (sharing the same database) is not very beneficial if these filesystem access would then be performed from a remote host.
Looks like we would really only run multiple instances of the openqa-webui.service
and keep the other services on the web UI host as they are. Currently we're actually running multiple web UI processes via preforking. Maybe we could reduce the number of prefork processes then. Otherwise we would likely need to take care not to exceed the limit for PostgreSQL connections, e.g. by implementing #55262.
can cause longer downtimes and make upgrades of OS more risky, e.g. when we do not have management access to VMs that might fail to reboot.
But some host needs to provide the "entry point" and database, right? So if that host is down we would still have a downtime.
Also, load-balancing can help as well as having switch-over deployments possible for easier testing, staging, etc.
As long as we share the same database I would be very careful to run multiple instances of different openQA versions in parallel.
Updated by okurz about 4 years ago
mkittler wrote:
start with the webui server
The web UI server runs a lot of services and provides the filesystem
/var/lib/openqa
so this raises a lot of questions:
- I assume there would still be just one database.
either that or database replication
- What about files within
/var/lib/openqa
? I suppose this file system still needs to be shared between the web UIs, e.g. using NFS like we already share/var/lib/openqa/share
with the worker hosts.
Yes, I do not see an immediate problem with that as we already have either scalability features in place, like minion jobs with locks, or the same problems we want to fix anyway, like when saving needles interferes with updating the git repo with fetchneedles.
- What about the scheduler? I suppose there should still be just one to avoid race conditions.
yes
- What about the liveviewhandler and websocket server? Both are so far separate services because only a single process is supposed to serve the routes they provide because they rely on resources which needed to be shared between multiple instances if there were multiple instances. In case of the liveviewhandler that resource is websocket connection to another server so sharing it would be quite hard. So there should be only one of these services (in their current form).
yes. This is why it is good that we have already separate services even though we do not make full use of the possibilities yet (like running multiple instances of webui service but not liveviewhandler and websocket)
- What about the Minion service? Since we rely on locking to avoid running jobs in parallel which shouldn't run in parallel I'd say there should only be one Minion service as well or multiple Minion services would share the same database. These jobs often involve a lot filesystem access under
/var/lib/openqa
so likely having multiple Minion services (sharing the same database) is not very beneficial if these filesystem access would then be performed from a remote host.Looks like we would really only run multiple instances of the
openqa-webui.service
and keep the other services on the web UI host as they are. Currently we're actually running multiple web UI processes via preforking. Maybe we could reduce the number of prefork processes then. Otherwise we would likely need to take care not to exceed the limit for PostgreSQL connections, e.g. by implementing #55262.
exactly, same as now run multiple processes but potentially spread out over different hosts and be able to dynamically alter the number and control them individually, e.g. for blue/green deployments.
can cause longer downtimes and make upgrades of OS more risky, e.g. when we do not have management access to VMs that might fail to reboot.
But some host needs to provide the "entry point" and database, right? So if that host is down we would still have a downtime.
yes, but this host can be the one used for the http load-balancer which are more simple than openQA itself, have already built-in scalability features and can monitored for example by kubernetes itself.
[…]
As long as we share the same database I would be very careful to run multiple instances of different openQA versions in parallel.
yes, careful but I would still do it :) If you like we could have a simple experimentation task first which would be "run two openQA webui instances of different versions against a single database". But I guess with just some simple practices how we introduce schema changes we should be good. If not, then just shut off all old versions as soon as the schema is newer than the currently supported one.
Updated by mkittler about 4 years ago
If you like we could have a simple experimentation task first which would be "run two openQA webui instances of different versions against a single database".
We don't need an experimentation task. I can already tell you that it will sometimes work and sometimes now. If both versions use the same schema version it will work and otherwise not. The version with the newer schema will obviously migrate the database to that new version. The old version will run into database errors then. I know that too well because I often switch between different versions locally which use a different schema version.
Updated by ilausuch about 4 years ago
- Status changed from Workable to In Progress
- Assignee set to ilausuch
Updated by ilausuch about 4 years ago
Initial approach
https://github.com/os-autoinst/openQA/pull/3431
Updated by ilausuch about 4 years ago
Right now I am covering all the cases that has being mentioned on the PR review and also
- removing the dependence on apache
- adding a loadbalancer for the webui
- configuring webui, livehandler, websockets, gru and scheduler as independent containers
webui_data_1 /bin/sh -c /usr/bin/tail - ... Up
webui_db-admin_1 entrypoint.sh docker-php-e ... Up 0.0.0.0:8080->8080/tcp
webui_db_1 docker-entrypoint.sh postgres Up 5432/tcp
webui_gru_1 /root/run_openqa.sh Up 9526/tcp, 9527/tcp, 9528/tcp, 9529/tcp
webui_haproxy_1 /docker-entrypoint.sh hapr ... Up 0.0.0.0:9526->9526/tcp
webui_livehandler_1 /root/run_openqa.sh Up 9526/tcp, 9527/tcp, 0.0.0.0:9528->9528/tcp, 9529/tcp
webui_scheduler_1 /root/run_openqa.sh Up 9526/tcp, 9527/tcp, 9528/tcp, 9529/tcp
webui_websockets_1 /root/run_openqa.sh Up 9526/tcp, 0.0.0.0:9527->9527/tcp, 9528/tcp, 9529/tcp
webui_webui_1 /root/run_openqa.sh Up 0.0.0.0:32813->9526/tcp, 9527/tcp, 9528/tcp, 9529/tcp
webui_webui_2 /root/run_openqa.sh Up 0.0.0.0:32814->9526/tcp, 9527/tcp, 9528/tcp, 9529/tcp
Updated by livdywan about 4 years ago
- Related to action #43715: Update upstream dockerfiles to provide an easy to use docker image of workers added
Updated by livdywan about 4 years ago
- Blocks action #43712: Update upstream dockerfiles to provide an easy to use docker image of openQA-webui added
Updated by okurz about 4 years ago
- Blocks deleted (action #43712: Update upstream dockerfiles to provide an easy to use docker image of openQA-webui)
Updated by okurz about 4 years ago
- Related to action #43712: Update upstream dockerfiles to provide an easy to use docker image of openQA-webui added
Updated by okurz about 4 years ago
- Due date set to 2020-10-08
ilausuch wrote:
Right now I am covering all the cases that has being mentioned on the PR review and also
- removing the dependence on apache
- adding a loadbalancer for the webui
- configuring webui, livehandler, websockets, gru and scheduler as independent containers
Sounds great. Please keep in mind the acceptance criteria in mind and see everything that is not immediately necessary for them to be out of scope. You can note down all additional ideas in this ticket or a followup or #65271 . Also please keep https://progress.opensuse.org/projects/qa/wiki/Wiki#SLOs-service-level-objectives in mind with "aim for cycle time of individual tickets (not epics or sagas): 1h-2w". Setting due date accordingly
Updated by ilausuch about 4 years ago
Some changes done in the PR. If you like it I'll proceed to document this version
Updated by ilausuch about 4 years ago
Added the documentation to the README.md
A question, should it be retro-compatible to run without docker-compose?
In case yes, I create a new commit that allows this retro-compatibility, so if you don't use the docker-compose is expected the same behavior as before.
Updated by ilausuch about 4 years ago
Before to close this task I would like to check the compatibility with the workers working in this ticket #43715
Updated by okurz about 4 years ago
- Related to action #41600: fallback mechanism for apache, e.g. on osd added
Updated by ilausuch about 4 years ago
This ticket is bloqued until the mistery of https://progress.opensuse.org/issues/43715#note-22 will be solved
Updated by ilausuch about 4 years ago
- Status changed from In Progress to Blocked
Updated by okurz about 4 years ago
- Status changed from Blocked to Workable
- Assignee deleted (
ilausuch)
Ok, that makes sense. We should use "blocked" only when we have another ticket reference to wait for. Let's say "workable" for someone else to pickup and "solve" that mistery :)
Updated by ilausuch about 4 years ago
Problems solved with the worker. Now it is using nginx as LB. Preparing the auto scaling and the PR
Updated by ilausuch about 4 years ago
https://github.com/os-autoinst/openQA/pull/3488
This PR has the same purpose than previous one but using nginx to be able to integrate the websocket, livehandler in the path structure of the API.
This solves the communication from the workers with the api and the websockets using the same port
Updated by ilausuch about 4 years ago
- Status changed from Workable to Resolved
- Assignee set to ilausuch
Updated by ilausuch about 4 years ago
- Status changed from Resolved to In Progress
Updated by okurz about 4 years ago
- Copied to action #76990: Improve documentation for redundant/load-balancing webui deployments of openQA added
Updated by okurz about 4 years ago
Thank you. To provide a little bit more information: We have a proof-of-concept with the instructions so far in https://github.com/os-autoinst/openQA/blob/master/docker/README.md . What we can do as next steps is to make sure this is included in https://open.qa/docs . Not in this ticket though, included now in #76990
Updated by okurz almost 4 years ago
- Subject changed from redundant/load-balancing webui deployments of openQA to [spike] redundant/load-balancing webui deployments of openQA