action #10960
Updated by okurz about 8 years ago
## observation Starting with early Friday, 2016-02-26, a performance regression on o.s.d was notified. I could reproduce what looks like the issue at hand locally. Enabling SQL query debugging this is how the logfile looks like when accessing e.g. `/group_overview/25` ``` [Sat Feb 27 19:50:18 2016] [debug] Routing to controller "OpenQA::WebAPI::Controller::Test" and action "overview" [Sat Feb 27 19:50:18 2016] [debug] [DBIx debug] Took 0.00022697 seconds executed: SELECT me.id, me.name, me.size_limit_gb, me.keep_logs_in_days, me.t_created, me.t_updated FROM job_groups me WHERE ( me.id = ? ): '25'. [Sat Feb 27 19:50:21 2016] [debug] [DBIx debug] Took 2.80118012 seconds executed: SELECT me.id, me.slug, me.result_dir, me.state, me.priority, me.result, me.worker_id, me.test, me.clone_id, me.retry_avbl, me.backend, me.backend_info, me.group_id, me.t_started, me.t_finished, me.t_created, me.t_updated, settings.id, settings.key, settings.value, settings.job_id, settings.t_created, settings.t_updated, parents.child_job_id, parents.parent_job_id, parents.dependency, children.child_job_id, children.parent_job_id, children.dependency FROM jobs me LEFT JOIN job_settings settings ON settings.job_id = me.id LEFT JOIN job_dependencies parents ON parents.child_job_id = me.id LEFT JOIN job_dependencies children ON children.parent_job_id = me.id WHERE ( ( me.clone_id IS NULL AND me.group_id = ? AND me.id IN ( SELECT me.job_id FROM job_settings me LEFT JOIN job_settings siblings ON siblings.job_id = me.job_id LEFT JOIN job_settings siblings_2 ON siblings_2.job_id = me.job_id WHERE ( ( ( me.key = ? AND me.value = ? ) AND ( siblings.key = ? AND siblings.value = ? ) AND ( siblings_2.key = ? AND siblings_2.value = ? ) ) ) ) ) ) ORDER BY me.id DESC: '25', 'DISTRI', 'sle', 'VERSION', '12-SP2', 'BUILD', '1186'. [Sat Feb 27 19:50:22 2016] [debug] [DBIx debug] Took 0.78418899 seconds executed: SELECT me.id, me.slug, me.result_dir, me.state, me.priority, me.result, me.worker_id, me.test, me.clone_id, me.retry_avbl, me.backend, me.backend_info, me.group_id, me.t_started, me.t_finished, me.t_created, me.t_updated, settings.id, settings.key, settings.value, settings.job_id, settings.t_created, settings.t_updated, parents.child_job_id, parents.parent_job_id, parents.dependency, children.child_job_id, children.parent_job_id, children.dependency FROM jobs me LEFT JOIN job_settings settings ON settings.job_id = me.id LEFT JOIN job_dependencies parents ON parents.child_job_id = me.id LEFT JOIN job_dependencies children ON children.parent_job_id = me.id WHERE ( ( me.clone_id IS NULL AND me.group_id = ? AND me.id IN ( SELECT me.job_id FROM job_settings me LEFT JOIN job_settings siblings ON siblings.job_id = me.job_id LEFT JOIN job_settings siblings_2 ON siblings_2.job_id = me.job_id WHERE ( ( ( me.key = ? AND me.value = ? ) AND ( siblings.key = ? AND siblings.value = ? ) AND ( siblings_2.key = ? AND siblings_2.value = ? ) ) ) ) ) ) ORDER BY me.id DESC: '25', 'DISTRI', 'sle', 'VERSION', '12-SP2', 'BUILD', '1186'. [Sat Feb 27 19:50:27 2016] [debug] [DBIx debug] Took 4.06245685 seconds executed: SELECT me.job_id, me.result, me.soft_failure, COUNT( id ) FROM job_modules me WHERE ( job_id IN (…). [Sat Feb 27 19:50:27 2016] [debug] [DBIx debug] Took 0.76173782 seconds executed: SELECT me.id, me.slug, me.result_dir, me.state, me.priority, me.result, me.worker_id, me.test, me.clone_id, me.retry_avbl, me.backend, me.backend_info, me.group_id, me.t_started, me.t_finished, me.t_created, me.t_updated, settings.id, settings.key, settings.value, settings.job_id, settings.t_created, settings.t_updated, parents.child_job_id, parents.parent_job_id, parents.dependency, children.child_job_id, children.parent_job_id, children.dependency FROM jobs me LEFT JOIN job_settings settings ON settings.job_id = me.id LEFT JOIN job_dependencies parents ON parents.child_job_id = me.id LEFT JOIN job_dependencies children ON children.parent_job_id = me.id WHERE ( ( me.clone_id IS NULL AND me.group_id = ? AND me.id IN ( SELECT me.job_id FROM job_settings me LEFT JOIN job_settings siblings ON siblings.job_id = me.job_id LEFT JOIN job_settings siblings_2 ON siblings_2.job_id = me.job_id WHERE ( ( ( me.key = ? AND me.value = ? ) AND ( siblings.key = ? AND siblings.value = ? ) AND ( siblings_2.key = ? AND siblings_2.value = ? ) ) ) ) ) ) ORDER BY me.id DESC: '25', 'DISTRI', 'sle', 'VERSION', '12-SP2', 'BUILD', '1186'. [Sat Feb 27 19:50:28 2016] [debug] [DBIx debug] Took 0.00085783 seconds executed: SELECT me.id, me.job_id, me.group_id, me.text, me.user_id, me.hidden, me.t_created, me.t_updated FROM comments me WHERE ( job_id IN (…). [Sat Feb 27 19:50:28 2016] [debug] [DBIx debug] Took 0.59689903 seconds executed: SELECT me.id, me.job_id, me.name, me.script, me.category, me.soft_failure, me.milestone, me.important, me.fatal, me.result, me.t_created, me.t_updated FROM job_modules me WHERE ( ( me.job_id = ? AND result = ? ) ): '219522', 'failed'. [Sat Feb 27 19:50:28 2016] [debug] [DBIx debug] Took 0.00040007 seconds executed: SELECT me.id, me.slug, me.result_dir, me.state, me.priority, me.result, me.worker_id, me.test, me.clone_id, me.retry_avbl, me.backend, me.backend_info, me.group_id, me.t_started, me.t_finished, me.t_created, me.t_updated FROM jobs me WHERE ( me.id = ? ): '219522'. [Sat Feb 27 19:50:28 2016] [debug] reading /var/lib/openqa/testresults/00219522-sle-12-SP2-Server-DVD-s390x-Build1186-gnome/details-installation_overview.json [Sat Feb 27 19:50:29 2016] [debug] [DBIx debug] Took 0.60596800 seconds executed: SELECT me.id, me.job_id, me.name, me.script, me.category, me.soft_failure, me.milestone, me.important, me.fatal, me.result, me.t_created, me.t_updated FROM job_modules me WHERE ( ( me.job_id = ? AND result = ? ) ): '219521', 'failed'. [Sat Feb 27 19:50:29 2016] [debug] [DBIx debug] Took 0.00020695 seconds executed: SELECT me.id, me.slug, me.result_dir, me.state, me.priority, me.result, me.worker_id, me.test, me.clone_id, me.retry_avbl, me.backend, me.backend_info, me.group_id, me.t_started, me.t_finished, me.t_created, me.t_updated FROM jobs me WHERE ( me.id = ? ): '219521'. [Sat Feb 27 19:50:29 2016] [debug] reading /var/lib/openqa/testresults/00219521-sle-12-SP2-Server-DVD-s390x-Build1186-minimal+base/details-installation_overview_before.json [Sat Feb 27 19:50:29 2016] [debug] [DBIx debug] Took 0.63927102 seconds executed: SELECT me.id, me.job_id, me.name, me.script, me.category, me.soft_failure, me.milestone, me.important, me.fatal, me.result, me.t_created, me.t_updated FROM job_modules me WHERE ( ( me.job_id = ? AND result = ? ) ): '219518', 'failed'. [Sat Feb 27 19:50:29 2016] [debug] [DBIx debug] Took 0.00028086 seconds executed: SELECT me.id, me.slug, me.result_dir, me.state, me.priority, me.result, me.worker_id, me.test, me.clone_id, me.retry_avbl, me.backend, me.backend_info, me.group_id, me.t_started, me.t_finished, me.t_created, me.t_updated FROM jobs me WHERE ( me.id = ? ): '219518'. [Sat Feb 27 19:50:29 2016] [debug] reading /var/lib/openqa/testresults/00219518-sle-12-SP2-Server-DVD-s390x-Build1186-textmode/details-installer_timezone.json [Sat Feb 27 19:5 ``` So what I see is many entries like ``` Took 0.63927102 seconds executed: SELECT me.id, … FROM job_modules me WHERE ( ( me.job_id = ? AND result = ? ) ): '219518', 'failed'. ``` ## steps to reproduce * load database dump 2016-02-26 from o.s.d into local PostgreSQL database (e.g. `sudo -u geekotest createdb openqa_suse_de_2016-02-26 && sudo -u geekotest pg_restore -d openqa_suse_de_2016-02-26 --role=geekotest ~/local/os-autoinst/openQA/local/SQL-DUMPS/openqa.suse.de/2016-02-26.dump`) * configure local openQA to load this database, e.g. ``` $ mkdir -p local && echo > local/stage_openqa_suse_de_2016-02-26/database.ini << EOF > [stage] > dsn = dbi:Pg:dbname=openqa > user = geekotest > EOF ``` * load openQA with this database and SQL queries, e.g. ``` $ time sudo -u geekotest OPENQA_SQL_DEBUG=1 OPENQA_CONFIG=local/stage_openqa_suse_de_2016-02-26 OPENQA_DATABASE=stage script/openqa get '/tests/overview?distri=sle&version=12-SP2&build=1201&groupid=25' ``` * observe super long loading time and slow queries ## problem The overall processing time is way to long, main waiting time from queries. My (okurz) local tests yield ``` real 0m59.996s user 0m1.472s sys 0m0.056s ``` to render the request as shown in "steps to reproduce" H1. *REJECTED* - A machine specific problem (DONE: REJECTED, could be reproduced locally) H2. Recent openQA changes introduced a performance regression H3. The database got weird because of recent openQA changes H4. The database got weird due to other effects ## suggestion E1-1. DONE: Try to reproduce locally -> could be reproduced E2-1. Crosscheck with older version, e.g. the one used before last upgrade on o.s.d, if confirmed, git bisect to find culprit E3-1. Run an older database dump with recent openQA E4-1. @waitfor E2-1+E3-1, if both fail, accept H4 TODO