action #10960
Updated by okurz about 8 years ago
## observation Starting with early Friday, 2016-02-26, a performance regression on o.s.d was notified. I could reproduce what looks like the issue at hand locally. Enabling SQL query debugging this is how the logfile looks like when accessing e.g. `/group_overview/25` ``` [Sat Feb 27 19:50:18 2016] [debug] Routing to controller "OpenQA::WebAPI::Controller::Test" and action "overview" [Sat Feb 27 19:50:18 2016] [debug] [DBIx debug] Took 0.00022697 seconds executed: SELECT me.id, me.name, me.size_limit_gb, me.keep_logs_in_days, me.t_created, me.t_updated FROM job_groups me WHERE ( me.id = ? ): '25'. [Sat Feb 27 19:50:21 2016] [debug] [DBIx debug] Took 2.80118012 seconds executed: SELECT me.id, me.slug, me.result_dir, me.state, me.priority, me.result, me.worker_id, me.test, me.clone_id, me.retry_avbl, me.backend, me.backend_info, me.group_id, me.t_started, me.t_finished, me.t_created, me.t_updated, settings.id, settings.key, settings.value, settings.job_id, settings.t_created, settings.t_updated, parents.child_job_id, parents.parent_job_id, parents.dependency, children.child_job_id, children.parent_job_id, children.dependency FROM jobs me LEFT JOIN job_settings settings ON settings.job_id = me.id LEFT JOIN job_dependencies parents ON parents.child_job_id = me.id LEFT JOIN job_dependencies children ON children.parent_job_id = me.id WHERE ( ( me.clone_id IS NULL AND me.group_id = ? AND me.id IN ( SELECT me.job_id FROM job_settings me LEFT JOIN job_settings siblings ON siblings.job_id = me.job_id LEFT JOIN job_settings siblings_2 ON siblings_2.job_id = me.job_id WHERE ( ( ( me.key = ? AND me.value = ? ) AND ( siblings.key = ? AND siblings.value = ? ) AND ( siblings_2.key = ? AND siblings_2.value = ? ) ) ) ) ) ) ORDER BY me.id DESC: '25', 'DISTRI', 'sle', 'VERSION', '12-SP2', 'BUILD', '1186'. [Sat Feb 27 19:50:22 2016] [debug] [DBIx debug] Took 0.78418899 seconds executed: SELECT me.id, me.slug, me.result_dir, me.state, me.priority, me.result, me.worker_id, me.test, me.clone_id, me.retry_avbl, me.backend, me.backend_info, me.group_id, me.t_started, me.t_finished, me.t_created, me.t_updated, settings.id, settings.key, settings.value, settings.job_id, settings.t_created, settings.t_updated, parents.child_job_id, parents.parent_job_id, parents.dependency, children.child_job_id, children.parent_job_id, children.dependency FROM jobs me LEFT JOIN job_settings settings ON settings.job_id = me.id LEFT JOIN job_dependencies parents ON parents.child_job_id = me.id LEFT JOIN job_dependencies children ON children.parent_job_id = me.id WHERE ( ( me.clone_id IS NULL AND me.group_id = ? AND me.id IN ( SELECT me.job_id FROM job_settings me LEFT JOIN job_settings siblings ON siblings.job_id = me.job_id LEFT JOIN job_settings siblings_2 ON siblings_2.job_id = me.job_id WHERE ( ( ( me.key = ? AND me.value = ? ) AND ( siblings.key = ? AND siblings.value = ? ) AND ( siblings_2.key = ? AND siblings_2.value = ? ) ) ) ) ) ) ORDER BY me.id DESC: '25', 'DISTRI', 'sle', 'VERSION', '12-SP2', 'BUILD', '1186'. [Sat Feb 27 19:50:27 2016] [debug] [DBIx debug] Took 4.06245685 seconds executed: SELECT me.job_id, me.result, me.soft_failure, COUNT( id ) FROM job_modules me WHERE ( job_id IN (…). [Sat Feb 27 19:50:27 2016] [debug] [DBIx debug] Took 0.76173782 seconds executed: SELECT me.id, me.slug, me.result_dir, me.state, me.priority, me.result, me.worker_id, me.test, me.clone_id, me.retry_avbl, me.backend, me.backend_info, me.group_id, me.t_started, me.t_finished, me.t_created, me.t_updated, settings.id, settings.key, settings.value, settings.job_id, settings.t_created, settings.t_updated, parents.child_job_id, parents.parent_job_id, parents.dependency, children.child_job_id, children.parent_job_id, children.dependency FROM jobs me LEFT JOIN job_settings settings ON settings.job_id = me.id LEFT JOIN job_dependencies parents ON parents.child_job_id = me.id LEFT JOIN job_dependencies children ON children.parent_job_id = me.id WHERE ( ( me.clone_id IS NULL AND me.group_id = ? AND me.id IN ( SELECT me.job_id FROM job_settings me LEFT JOIN job_settings siblings ON siblings.job_id = me.job_id LEFT JOIN job_settings siblings_2 ON siblings_2.job_id = me.job_id WHERE ( ( ( me.key = ? AND me.value = ? ) AND ( siblings.key = ? AND siblings.value = ? ) AND ( siblings_2.key = ? AND siblings_2.value = ? ) ) ) ) ) ) ORDER BY me.id DESC: '25', 'DISTRI', 'sle', 'VERSION', '12-SP2', 'BUILD', '1186'. [Sat Feb 27 19:50:28 2016] [debug] [DBIx debug] Took 0.00085783 seconds executed: SELECT me.id, me.job_id, me.group_id, me.text, me.user_id, me.hidden, me.t_created, me.t_updated FROM comments me WHERE ( job_id IN (…). [Sat Feb 27 19:50:28 2016] [debug] [DBIx debug] Took 0.59689903 seconds executed: SELECT me.id, me.job_id, me.name, me.script, me.category, me.soft_failure, me.milestone, me.important, me.fatal, me.result, me.t_created, me.t_updated FROM job_modules me WHERE ( ( me.job_id = ? AND result = ? ) ): '219522', 'failed'. [Sat Feb 27 19:50:28 2016] [debug] [DBIx debug] Took 0.00040007 seconds executed: SELECT me.id, me.slug, me.result_dir, me.state, me.priority, me.result, me.worker_id, me.test, me.clone_id, me.retry_avbl, me.backend, me.backend_info, me.group_id, me.t_started, me.t_finished, me.t_created, me.t_updated FROM jobs me WHERE ( me.id = ? ): '219522'. [Sat Feb 27 19:50:28 2016] [debug] reading /var/lib/openqa/testresults/00219522-sle-12-SP2-Server-DVD-s390x-Build1186-gnome/details-installation_overview.json [Sat Feb 27 19:50:29 2016] [debug] [DBIx debug] Took 0.60596800 seconds executed: SELECT me.id, me.job_id, me.name, me.script, me.category, me.soft_failure, me.milestone, me.important, me.fatal, me.result, me.t_created, me.t_updated FROM job_modules me WHERE ( ( me.job_id = ? AND result = ? ) ): '219521', 'failed'. [Sat Feb 27 19:50:29 2016] [debug] [DBIx debug] Took 0.00020695 seconds executed: SELECT me.id, me.slug, me.result_dir, me.state, me.priority, me.result, me.worker_id, me.test, me.clone_id, me.retry_avbl, me.backend, me.backend_info, me.group_id, me.t_started, me.t_finished, me.t_created, me.t_updated FROM jobs me WHERE ( me.id = ? ): '219521'. [Sat Feb 27 19:50:29 2016] [debug] reading /var/lib/openqa/testresults/00219521-sle-12-SP2-Server-DVD-s390x-Build1186-minimal+base/details-installation_overview_before.json [Sat Feb 27 19:50:29 2016] [debug] [DBIx debug] Took 0.63927102 seconds executed: SELECT me.id, me.job_id, me.name, me.script, me.category, me.soft_failure, me.milestone, me.important, me.fatal, me.result, me.t_created, me.t_updated FROM job_modules me WHERE ( ( me.job_id = ? AND result = ? ) ): '219518', 'failed'. [Sat Feb 27 19:50:29 2016] [debug] [DBIx debug] Took 0.00028086 seconds executed: SELECT me.id, me.slug, me.result_dir, me.state, me.priority, me.result, me.worker_id, me.test, me.clone_id, me.retry_avbl, me.backend, me.backend_info, me.group_id, me.t_started, me.t_finished, me.t_created, me.t_updated FROM jobs me WHERE ( me.id = ? ): '219518'. [Sat Feb 27 19:50:29 2016] [debug] reading /var/lib/openqa/testresults/00219518-sle-12-SP2-Server-DVD-s390x-Build1186-textmode/details-installer_timezone.json [Sat Feb 27 19:5 ``` So what I see quers for "job_modules" and "job_settings" taking long is many entries like ``` Took 0.63927102 seconds executed: SELECT me.id, … FROM job_modules me WHERE ( ( me.job_id = ? AND result = ? ) ): '219518', 'failed'. ``` ## steps to reproduce * load database dump 2016-02-26 from o.s.d into local PostgreSQL database (e.g. `sudo -u geekotest createdb openqa_suse_de_2016-02-26 && sudo -u geekotest pg_restore -d openqa_suse_de_2016-02-26 --role=geekotest ~/local/os-autoinst/openQA/local/SQL-DUMPS/openqa.suse.de/2016-02-26.dump`) * configure local openQA to load this database, e.g. ``` $ mkdir -p local && cat - > local/stage_openqa_suse_de_2016-02-26/database.ini << EOF [stage] dsn = dbi:Pg:dbname=openqa_suse_de_2016-02-26 user = geekotest EOF ``` * load openQA with this database and SQL queries, e.g. ``` $ time sudo -u geekotest OPENQA_SQL_DEBUG=1 OPENQA_CONFIG=local/stage_openqa_suse_de_2016-02-26 OPENQA_DATABASE=stage script/openqa get '/tests/overview?distri=sle&version=12-SP2&build=1201&groupid=25' ``` * observe super long loading time and slow queries ## problem The overall processing time is way to long, main waiting time from queries. My (okurz) local tests yield 15s for loading ``` real 0m59.996s user 0m1.472s sys 0m0.056s ``` to render the index page with database dump from 2016-02-26 whereas it was around 2s for 2016-02-25. request as shown in "steps to reproduce" H1: H1. *REJECTED* - A machine specific problem (DONE: REJECTED, see E1-1, could be reproduced locally) H2: H2. *REJECTED* - Recent openQA changes introduced a performance regression (DONE: REJECTED, see #10960#4 E2-1) H3: *ACCEPTED* - H3. The database got weird because of recent openQA changes H3.1: *REJECTED* - As more comments are used now because of okurz's changes the parsing gets slow (DONE: REJECTED, see #10960#6) H3.2: *ACCEPTED* - Other changes cause the slowdown (DONE: ACCEPTED, see #10960#6) H3.2.1: Something caused many more jobs considered for job settings queries to appear recently H3.2.2: postgreSQL decides on its own that it should consider more jobs in job settings queries trying to do right but failing H4: *REJECTED* - H4. The database got weird due to other effects (DONE: REJECTED, see E4-1) ## suggestion E1-1: E1-1. DONE: Try to reproduce locally -> could be reproduced E2-1: E2-1. DONE: Crosscheck with older version, e.g. the one used before last upgrade on o.s.d, if confirmed, git bisect to find culprit -> old version (97b8d9238aa918493883199e4da88eef3e578797) does not show an improvement in performance -> database f*ed E3-1: DONE: E3-1. Run an older database dump with recent openQA -> older database still fine (see #10960#4) E3.2.1-1: TODO ask others, don't know what could have caused this E3.2.2-1: TODO ask @coolo as he mentioned something like this recently E4-1: DONE: E4-1. @waitfor E2-1+E3-1, if both fail, accept H4 -> E2-1 FAIL, E3-1 SUCCESS