Project

General

Profile

Actions

action #89731

closed

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #89842: [epic] Scalable and streamlined docker-compose based openQA setup

containers: The deploy using docker-compose is not stable and eventually fails

Added by ilausuch almost 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2021-03-09
Due date:
% Done:

0%

Estimated time:

Description

Motivation

The command 'docker-compose up' is executed without errors in normal circustancies, but sometimes some of the containers fail later after the docker-compose has finished.

$ docker-compose up -d
Creating webui_db_1    ... done
Creating webui_nginx_1       ... done
Creating webui_data_1  ... done
Creating webui_scheduler_1   ... done
Creating webui_webui_1       ... done
Creating webui_webui_2       ... done
Creating webui_gru_1         ... done
Creating webui_websockets_1  ... done
Creating webui_livehandler_1 ... done
$ echo $?
0
docker-compose ps
       Name                      Command                State                                     Ports                                 
----------------------------------------------------------------------------------------------------------------------------------------
webui_data_1          /bin/sh -c /usr/bin/tail - ...   Up                                                                               
webui_db_1            docker-entrypoint.sh postgres    Up         5432/tcp                                                              
webui_gru_1           /root/run_openqa.sh              Up         443/tcp, 80/tcp, 9526/tcp, 9527/tcp, 9528/tcp, 9529/tcp               
webui_livehandler_1   /root/run_openqa.sh              Up         443/tcp, 80/tcp, 9526/tcp, 9527/tcp, 0.0.0.0:9528->9528/tcp, 9529/tcp 
webui_nginx_1         /entrypoint.sh                   Up         0.0.0.0:9526->9526/tcp                                                
webui_scheduler_1     /root/run_openqa.sh              Exit 255                                                                         
webui_websockets_1    /root/run_openqa.sh              Up         443/tcp, 80/tcp, 9526/tcp, 0.0.0.0:9527->9527/tcp, 9528/tcp, 9529/tcp 
webui_webui_1         /root/run_openqa.sh              Up         443/tcp, 80/tcp, 0.0.0.0:32789->9526/tcp, 9527/tcp, 9528/tcp, 9529/tcp
webui_webui_2         /root/run_openqa.sh              Up         443/tcp, 80/tcp, 0.0.0.0:32790->9526/tcp, 9527/tcp, 9528/tcp, 9529/tcp

The errors in schedulers are:

scheduler_1    | failed to run SQL in /usr/share/openqa/script/../dbicdh/PostgreSQL/deploy/90/001-auto-__VERSION.sql: DBIx::Class::DeploymentHandler::DeployMethod::SQL::Translator::try {...} (): DBI Exception: DBD::Pg::db do failed: ERROR:  duplicate key value violates unique constraint "pg_type_typname_nsp_index"
scheduler_1    | DETAIL:  Key (typname, typnamespace)=(dbix_class_deploymenthandler_versions_id_seq, 2200) already exists. at inline delegation in DBIx::Class::DeploymentHandler for deploy_method->deploy (attribute declared in /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/DeploymentHandler/WithApplicatorDumple.pm at line 51) line 18
scheduler_1    |  (running line 'CREATE TABLE dbix_class_deploymenthandler_versions ( id serial NOT NULL, version character varying(50) NOT NULL, ddl text, upgrade_sql text, PRIMARY KEY (id), CONSTRAINT dbix_class_deploymenthandler_versions_version UNIQUE (version) )') at /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/DeploymentHandler/DeployMethod/SQL/Translator.pm line 263.
scheduler_1    | DBIx::Class::Storage::TxnScopeGuard::DESTROY(): A DBIx::Class::Storage::TxnScopeGuard went out of scope without explicit commit or error. Rolling back. at /usr/share/openqa/script/openqa-scheduler line 0
scheduler_1    | DBIx::Class::Storage::TxnScopeGuard::DESTROY(): A DBIx::Class::Storage::TxnScopeGuard went out of scope without explicit commit or error. Rolling back. at /usr/share/openqa/script/openqa-scheduler line 0

The problem is that every container that uses openqa_webui image (webui_webui, webui_websockets, webui_scheduler, webui_livehandler) try to initialize the DB tables. And as all the containers are initialized at the same time surges conflicts.

Acceptance Criteria

  • AC 1: All the containers remain up after execute docker-compose up * AC 2: Expand the docker-compose CI test to include this case

Suggestions


Related issues 3 (0 open3 closed)

Related to openQA Project (public) - action #91046: CI: "webui-docker-compose" seems that eventually fails againResolvedilausuch2021-04-13

Actions
Blocked by openQA Project (public) - action #89719: docker-compose up fails on masterResolvedilausuch2021-03-09

Actions
Blocked by openQA Project (public) - action #89722: Need automatic check for docker-composeResolvedilausuch2021-03-09

Actions
Actions #1

Updated by ilausuch almost 4 years ago

  • Description updated (diff)
Actions #2

Updated by livdywan almost 4 years ago

  • Blocked by action #89719: docker-compose up fails on master added
Actions #3

Updated by livdywan almost 4 years ago

I think logically this will come after #89719 hence marking this as Blocked, since the AC assume an existing compose test that can be extended

Actions #4

Updated by livdywan almost 4 years ago

  • Blocks action #76978: How to run an openQA test in 5 minutes size:M added
Actions #5

Updated by livdywan almost 4 years ago

cdywan wrote:

I think logically this will come after #89719 hence marking this as Blocked, since the AC assume an existing compose test that can be extended

Correction. Since @okurz filed a separate issue this can actually be blocked by #89722

Actions #6

Updated by livdywan almost 4 years ago

  • Blocks deleted (action #76978: How to run an openQA test in 5 minutes size:M)
Actions #7

Updated by livdywan almost 4 years ago

  • Blocked by action #89722: Need automatic check for docker-compose added
Actions #8

Updated by ilausuch almost 4 years ago

I checked that the list of running containers could be checked with
docker-compose ps --service --filter status=running

Unfortunately I checked using status=exited but only works on docker but not on docker-compose (At least in the version 1.27.4)

An other option is to use docker-compose ps | grep Exit but this has the problem that none of the container have a name containing Exit

Actions #9

Updated by livdywan almost 4 years ago

ilausuch wrote:

I checked that the list of running containers could be checked with
docker-compose ps --service --filter status=running

Unfortunately I checked using status=exited but only works on docker but not on docker-compose (At least in the version 1.27.4)

An other option is to use docker-compose ps | grep Exit but this has the problem that none of the container have a name containing Exit

I would approach it from the mojo app first before looking at the containers. If e.g. database access is not locked that would be a problem for other use cases, too.

Actions #10

Updated by okurz almost 4 years ago

  • Target version set to future
Actions #11

Updated by livdywan almost 4 years ago

Target version set to future

From the description it seemed like a pretty serious bug to me. Did you confirm that it's not?

Actions #12

Updated by okurz almost 4 years ago

No, I did not. We did the exception to originally accept the docker-compose.yaml without automatic tests. So we just learned again what we did know anyway: Every code contribution needs an automatic test. Gladly the docker-compose approach is not prominently mentioned in documentation yet.

So #89722 first

Actions #13

Updated by livdywan almost 4 years ago

okurz wrote:

No, I did not. We did the exception to originally accept the docker-compose.yaml without automatic tests. So we just learned again what we did know anyway: Every code contribution needs an automatic test. Gladly the docker-compose approach is not prominently mentioned in documentation yet.

So #89722 first

Sure. I just thought that's conveyed by the blocked status and not necessary to handle manually.

Actions #14

Updated by okurz almost 4 years ago

  • Parent task set to #89842
Actions #15

Updated by ilausuch almost 4 years ago

I ran a test that fails
https://github.com/os-autoinst/openQA/pull/3818/checks?check_run_id=2236273686
I only did a sleep before check that all the containers are UP to grant time to the failure.

Actions #16

Updated by ilausuch over 3 years ago

  • Status changed from New to In Progress
Actions #17

Updated by ilausuch over 3 years ago

There is a draft PR with some approaches
https://github.com/os-autoinst/openQA/pull/3821

The problem that I have detected is that all the webui containers try to create and initialize the DB at the same time, and since all the webui replicas are launched at the same time by the docker-compose and we have more than one, it generates the problem
My approach has been to create an initial webui and sort the dependencies so that the webui cluster is only initialized after initializing this first webui.
In addition, I base the dependencies on the health checks ensuring that it does not go to the next phase unless the previous one is complete.

Actions #18

Updated by livdywan over 3 years ago

  • Target version changed from future to Ready

I gather this is Ready then, after confirming with @okurz and @ilausuch that there was an internal conversation about this ticket not reflected here.

Actions #19

Updated by ilausuch over 3 years ago

Why it takes >30min?

I think the process spend a lot of time building the images. And in empirical test in my local machine, the installation of packages is the most tedious part. But I think this should measured to be sure.

One idea:
We need to build these images because maybe the PR that is launching the test do some changes that should be tested. How ever, we could base these images in a previous build image where all the packages are already installed. This will accelerate the process of testing. At least it is something that I have already tried in my experience

Actions #20

Updated by livdywan over 3 years ago

ilausuch wrote:

Why it takes >30min?

Was that comment meant for #90767 ?

Actions #21

Updated by ilausuch over 3 years ago

  • Related to action #91046: CI: "webui-docker-compose" seems that eventually fails again added
Actions #22

Updated by okurz over 3 years ago

  • Due date set to 2021-04-21
  • Assignee set to ilausuch

@ilausuch I assume you should be assignee due to #89731#note-16 ?

Actions #23

Updated by ilausuch over 3 years ago

We have here a new situation under investigation
https://github.com/os-autoinst/openQA/pull/3850/checks?check_run_id=2388942783

Webui_1 is unhealthy

        Name                       Command                   State                                        Ports                                 
------------------------------------------------------------------------------------------------------------------------------------------------
webui_db_1              docker-entrypoint.sh postgres    Up (healthy)     5432/tcp                                                              
webui_webui_1           /root/run_openqa.sh              Up (unhealthy)   443/tcp, 80/tcp, 0.0.0.0:49155->9526/tcp, 9527/tcp, 9528/tcp, 9529/tcp
webui_webui_2           /root/run_openqa.sh              Up (healthy)     443/tcp, 80/tcp, 0.0.0.0:49154->9526/tcp, 9527/tcp, 9528/tcp, 9529/tcp
webui_webui_db_init_1   sh -c chmod -R a+rwX /data ...   Up (healthy)     443/tcp, 80/tcp, 0.0.0.0:49153->9526/tcp, 9527/tcp, 9528/tcp, 9529/tcp
Actions #25

Updated by livdywan over 3 years ago

ilausuch wrote:

Solved with https://github.com/os-autoinst/openQA/pull/3850

So is #91377 a duplicate of this? Or vice versa? Is the same problem being solved here?

Actions #26

Updated by ilausuch over 3 years ago

No, #91377 was for the static check and this if for the docker-compose test

Actions #28

Updated by ilausuch over 3 years ago

About the AC2 all the container should be with a healthy status.

But we can see here that nginx (that is the last container in the chain) needs some time to stand-up.

https://github.com/os-autoinst/openQA/pull/3864/checks?check_run_id=2444880438

        Name                       Command                       State                                                     Ports                                          
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
webui_db_1              docker-entrypoint.sh postgres    Up (healthy)            5432/tcp                                                                                 
webui_gru_1             sh -c /root/run_openqa.sh| ...   Up (healthy)            443/tcp, 80/tcp, 9526/tcp, 9527/tcp, 9528/tcp, 9529/tcp                                  
webui_livehandler_1     /root/run_openqa.sh              Up (healthy)            443/tcp, 80/tcp, 9526/tcp, 9527/tcp, 0.0.0.0:9528->9528/tcp,:::9528->9528/tcp, 9529/tcp  
webui_nginx_1           /entrypoint.sh                   Up (health: starting)   0.0.0.0:9526->9526/tcp,:::9526->9526/tcp                                                 
webui_scheduler_1       /root/run_openqa.sh              Up (healthy)            443/tcp, 80/tcp, 9526/tcp, 9527/tcp, 9528/tcp, 9529/tcp                                  
webui_websockets_1      /root/run_openqa.sh              Up (healthy)            443/tcp, 80/tcp, 9526/tcp, 0.0.0.0:9527->9527/tcp,:::9527->9527/tcp, 9528/tcp, 9529/tcp  
webui_webui_1           /root/run_openqa.sh              Up (healthy)            443/tcp, 80/tcp, 0.0.0.0:49155->9526/tcp,:::49155->9526/tcp, 9527/tcp, 9528/tcp, 9529/tcp
webui_webui_2           /root/run_openqa.sh              Up (healthy)            443/tcp, 80/tcp, 0.0.0.0:49154->9526/tcp,:::49154->9526/tcp, 9527/tcp, 9528/tcp, 9529/tcp
webui_webui_db_init_1   sh -c chmod -R a+rwX /data ...   Up (healthy)            443/tcp, 80/tcp, 0.0.0.0:49153->9526/tcp,:::49153->9526/tcp, 9527/tcp, 9528/tcp, 9529/tcp

I am thinking in two possible solutions

  • Give some time to nginx to start with an active waiting until healthy or unhealthy that are "final statuses"
  • Create a dummy container dependent of nginx to ensure that nginx has started with healthy
Actions #29

Updated by ilausuch over 3 years ago

The AC2 has moved to #91815. As far as AC1 is covered (https://progress.opensuse.org/issues/89731#note-27) this ticket can be considered resolved. do you Agree?

Actions #30

Updated by ilausuch over 3 years ago

  • Description updated (diff)
  • Status changed from In Progress to Resolved
Actions #31

Updated by okurz over 3 years ago

  • Due date deleted (2021-04-21)
Actions

Also available in: Atom PDF