action #89731
closedcoordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
coordination #89842: [epic] Scalable and streamlined docker-compose based openQA setup
containers: The deploy using docker-compose is not stable and eventually fails
Description
Motivation¶
The command 'docker-compose up' is executed without errors in normal circustancies, but sometimes some of the containers fail later after the docker-compose has finished.
$ docker-compose up -d
Creating webui_db_1 ... done
Creating webui_nginx_1 ... done
Creating webui_data_1 ... done
Creating webui_scheduler_1 ... done
Creating webui_webui_1 ... done
Creating webui_webui_2 ... done
Creating webui_gru_1 ... done
Creating webui_websockets_1 ... done
Creating webui_livehandler_1 ... done
$ echo $?
0
docker-compose ps
Name Command State Ports
----------------------------------------------------------------------------------------------------------------------------------------
webui_data_1 /bin/sh -c /usr/bin/tail - ... Up
webui_db_1 docker-entrypoint.sh postgres Up 5432/tcp
webui_gru_1 /root/run_openqa.sh Up 443/tcp, 80/tcp, 9526/tcp, 9527/tcp, 9528/tcp, 9529/tcp
webui_livehandler_1 /root/run_openqa.sh Up 443/tcp, 80/tcp, 9526/tcp, 9527/tcp, 0.0.0.0:9528->9528/tcp, 9529/tcp
webui_nginx_1 /entrypoint.sh Up 0.0.0.0:9526->9526/tcp
webui_scheduler_1 /root/run_openqa.sh Exit 255
webui_websockets_1 /root/run_openqa.sh Up 443/tcp, 80/tcp, 9526/tcp, 0.0.0.0:9527->9527/tcp, 9528/tcp, 9529/tcp
webui_webui_1 /root/run_openqa.sh Up 443/tcp, 80/tcp, 0.0.0.0:32789->9526/tcp, 9527/tcp, 9528/tcp, 9529/tcp
webui_webui_2 /root/run_openqa.sh Up 443/tcp, 80/tcp, 0.0.0.0:32790->9526/tcp, 9527/tcp, 9528/tcp, 9529/tcp
The errors in schedulers are:
scheduler_1 | failed to run SQL in /usr/share/openqa/script/../dbicdh/PostgreSQL/deploy/90/001-auto-__VERSION.sql: DBIx::Class::DeploymentHandler::DeployMethod::SQL::Translator::try {...} (): DBI Exception: DBD::Pg::db do failed: ERROR: duplicate key value violates unique constraint "pg_type_typname_nsp_index"
scheduler_1 | DETAIL: Key (typname, typnamespace)=(dbix_class_deploymenthandler_versions_id_seq, 2200) already exists. at inline delegation in DBIx::Class::DeploymentHandler for deploy_method->deploy (attribute declared in /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/DeploymentHandler/WithApplicatorDumple.pm at line 51) line 18
scheduler_1 | (running line 'CREATE TABLE dbix_class_deploymenthandler_versions ( id serial NOT NULL, version character varying(50) NOT NULL, ddl text, upgrade_sql text, PRIMARY KEY (id), CONSTRAINT dbix_class_deploymenthandler_versions_version UNIQUE (version) )') at /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/DeploymentHandler/DeployMethod/SQL/Translator.pm line 263.
scheduler_1 | DBIx::Class::Storage::TxnScopeGuard::DESTROY(): A DBIx::Class::Storage::TxnScopeGuard went out of scope without explicit commit or error. Rolling back. at /usr/share/openqa/script/openqa-scheduler line 0
scheduler_1 | DBIx::Class::Storage::TxnScopeGuard::DESTROY(): A DBIx::Class::Storage::TxnScopeGuard went out of scope without explicit commit or error. Rolling back. at /usr/share/openqa/script/openqa-scheduler line 0
The problem is that every container that uses openqa_webui image (webui_webui, webui_websockets, webui_scheduler, webui_livehandler) try to initialize the DB tables. And as all the containers are initialized at the same time surges conflicts.
Acceptance Criteria¶
- AC 1: All the containers remain up after execute docker-compose up
* AC 2: Expand the docker-compose CI test to include this case
Suggestions¶
- Use dependencies (depends_on) based on health-checks to sort the startup of all the containers.
- Check current solution on https://github.com/os-autoinst/openQA/pull/3755
Updated by livdywan over 3 years ago
- Blocked by action #89719: docker-compose up fails on master added
Updated by livdywan over 3 years ago
I think logically this will come after #89719 hence marking this as Blocked, since the AC assume an existing compose test that can be extended
Updated by livdywan over 3 years ago
- Blocks action #76978: How to run an openQA test in 5 minutes size:M added
Updated by livdywan over 3 years ago
Updated by livdywan over 3 years ago
- Blocks deleted (action #76978: How to run an openQA test in 5 minutes size:M)
Updated by livdywan over 3 years ago
- Blocked by action #89722: Need automatic check for docker-compose added
Updated by ilausuch over 3 years ago
I checked that the list of running containers could be checked with
docker-compose ps --service --filter status=running
Unfortunately I checked using status=exited but only works on docker but not on docker-compose (At least in the version 1.27.4)
An other option is to use docker-compose ps | grep Exit but this has the problem that none of the container have a name containing Exit
Updated by livdywan over 3 years ago
ilausuch wrote:
I checked that the list of running containers could be checked with
docker-compose ps --service --filter status=runningUnfortunately I checked using status=exited but only works on docker but not on docker-compose (At least in the version 1.27.4)
An other option is to use docker-compose ps | grep Exit but this has the problem that none of the container have a name containing Exit
I would approach it from the mojo app first before looking at the containers. If e.g. database access is not locked that would be a problem for other use cases, too.
Updated by livdywan over 3 years ago
Target version set to future
From the description it seemed like a pretty serious bug to me. Did you confirm that it's not?
Updated by okurz over 3 years ago
No, I did not. We did the exception to originally accept the docker-compose.yaml without automatic tests. So we just learned again what we did know anyway: Every code contribution needs an automatic test. Gladly the docker-compose approach is not prominently mentioned in documentation yet.
So #89722 first
Updated by livdywan over 3 years ago
okurz wrote:
No, I did not. We did the exception to originally accept the docker-compose.yaml without automatic tests. So we just learned again what we did know anyway: Every code contribution needs an automatic test. Gladly the docker-compose approach is not prominently mentioned in documentation yet.
So #89722 first
Sure. I just thought that's conveyed by the blocked status and not necessary to handle manually.
Updated by ilausuch over 3 years ago
I ran a test that fails
https://github.com/os-autoinst/openQA/pull/3818/checks?check_run_id=2236273686
I only did a sleep before check that all the containers are UP to grant time to the failure.
Updated by ilausuch over 3 years ago
There is a draft PR with some approaches
https://github.com/os-autoinst/openQA/pull/3821
The problem that I have detected is that all the webui containers try to create and initialize the DB at the same time, and since all the webui replicas are launched at the same time by the docker-compose and we have more than one, it generates the problem
My approach has been to create an initial webui and sort the dependencies so that the webui cluster is only initialized after initializing this first webui.
In addition, I base the dependencies on the health checks ensuring that it does not go to the next phase unless the previous one is complete.
Updated by ilausuch over 3 years ago
Why it takes >30min?
I think the process spend a lot of time building the images. And in empirical test in my local machine, the installation of packages is the most tedious part. But I think this should measured to be sure.
One idea:
We need to build these images because maybe the PR that is launching the test do some changes that should be tested. How ever, we could base these images in a previous build image where all the packages are already installed. This will accelerate the process of testing. At least it is something that I have already tried in my experience
Updated by livdywan over 3 years ago
Updated by ilausuch over 3 years ago
- Related to action #91046: CI: "webui-docker-compose" seems that eventually fails again added
Updated by okurz over 3 years ago
- Due date set to 2021-04-21
- Assignee set to ilausuch
@ilausuch I assume you should be assignee due to #89731#note-16 ?
Updated by ilausuch over 3 years ago
We have here a new situation under investigation
https://github.com/os-autoinst/openQA/pull/3850/checks?check_run_id=2388942783
Webui_1 is unhealthy
Name Command State Ports
------------------------------------------------------------------------------------------------------------------------------------------------
webui_db_1 docker-entrypoint.sh postgres Up (healthy) 5432/tcp
webui_webui_1 /root/run_openqa.sh Up (unhealthy) 443/tcp, 80/tcp, 0.0.0.0:49155->9526/tcp, 9527/tcp, 9528/tcp, 9529/tcp
webui_webui_2 /root/run_openqa.sh Up (healthy) 443/tcp, 80/tcp, 0.0.0.0:49154->9526/tcp, 9527/tcp, 9528/tcp, 9529/tcp
webui_webui_db_init_1 sh -c chmod -R a+rwX /data ... Up (healthy) 443/tcp, 80/tcp, 0.0.0.0:49153->9526/tcp, 9527/tcp, 9528/tcp, 9529/tcp
Updated by ilausuch over 3 years ago
Solved with https://github.com/os-autoinst/openQA/pull/3850
Updated by livdywan over 3 years ago
ilausuch wrote:
Solved with https://github.com/os-autoinst/openQA/pull/3850
So is #91377 a duplicate of this? Or vice versa? Is the same problem being solved here?
Updated by ilausuch over 3 years ago
No, #91377 was for the static check and this if for the docker-compose test
Updated by ilausuch over 3 years ago
The AC1 is solved with https://github.com/os-autoinst/openQA/pull/3821
Updated by ilausuch over 3 years ago
About the AC2 all the container should be with a healthy status.
But we can see here that nginx (that is the last container in the chain) needs some time to stand-up.
https://github.com/os-autoinst/openQA/pull/3864/checks?check_run_id=2444880438
Name Command State Ports
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
webui_db_1 docker-entrypoint.sh postgres Up (healthy) 5432/tcp
webui_gru_1 sh -c /root/run_openqa.sh| ... Up (healthy) 443/tcp, 80/tcp, 9526/tcp, 9527/tcp, 9528/tcp, 9529/tcp
webui_livehandler_1 /root/run_openqa.sh Up (healthy) 443/tcp, 80/tcp, 9526/tcp, 9527/tcp, 0.0.0.0:9528->9528/tcp,:::9528->9528/tcp, 9529/tcp
webui_nginx_1 /entrypoint.sh Up (health: starting) 0.0.0.0:9526->9526/tcp,:::9526->9526/tcp
webui_scheduler_1 /root/run_openqa.sh Up (healthy) 443/tcp, 80/tcp, 9526/tcp, 9527/tcp, 9528/tcp, 9529/tcp
webui_websockets_1 /root/run_openqa.sh Up (healthy) 443/tcp, 80/tcp, 9526/tcp, 0.0.0.0:9527->9527/tcp,:::9527->9527/tcp, 9528/tcp, 9529/tcp
webui_webui_1 /root/run_openqa.sh Up (healthy) 443/tcp, 80/tcp, 0.0.0.0:49155->9526/tcp,:::49155->9526/tcp, 9527/tcp, 9528/tcp, 9529/tcp
webui_webui_2 /root/run_openqa.sh Up (healthy) 443/tcp, 80/tcp, 0.0.0.0:49154->9526/tcp,:::49154->9526/tcp, 9527/tcp, 9528/tcp, 9529/tcp
webui_webui_db_init_1 sh -c chmod -R a+rwX /data ... Up (healthy) 443/tcp, 80/tcp, 0.0.0.0:49153->9526/tcp,:::49153->9526/tcp, 9527/tcp, 9528/tcp, 9529/tcp
I am thinking in two possible solutions
- Give some time to nginx to start with an active waiting until healthy or unhealthy that are "final statuses"
- Create a dummy container dependent of nginx to ensure that nginx has started with healthy
Updated by ilausuch over 3 years ago
The AC2 has moved to #91815. As far as AC1 is covered (https://progress.opensuse.org/issues/89731#note-27) this ticket can be considered resolved. do you Agree?
Updated by ilausuch over 3 years ago
- Description updated (diff)
- Status changed from In Progress to Resolved