Project

General

Profile

action #10960

Updated by okurz about 8 years ago

## observation 
 Starting with early Friday, 2016-02-26, a performance regression on o.s.d was notified. I could reproduce what looks like the issue at hand locally. Enabling SQL query debugging this is how the logfile looks like when accessing e.g. `/group_overview/25` 

 ``` 
 [Sat Feb 27 19:50:18 2016] [debug] Routing to controller "OpenQA::WebAPI::Controller::Test" and action "overview" 
 [Sat Feb 27 19:50:18 2016] [debug] [DBIx debug] Took 0.00022697 seconds executed: SELECT me.id, me.name, me.size_limit_gb, me.keep_logs_in_days, me.t_created, me.t_updated FROM job_groups me WHERE ( me.id = ? ): '25'. 
 [Sat Feb 27 19:50:21 2016] [debug] [DBIx debug] Took 2.80118012 seconds executed: SELECT me.id, me.slug, me.result_dir, me.state, me.priority, me.result, me.worker_id, me.test, me.clone_id, me.retry_avbl, me.backend, me.backend_info, me.group_id, me.t_started, me.t_finished, me.t_created, me.t_updated, settings.id, settings.key, settings.value, settings.job_id, settings.t_created, settings.t_updated, parents.child_job_id, parents.parent_job_id, parents.dependency, children.child_job_id, children.parent_job_id, children.dependency FROM jobs me LEFT JOIN job_settings settings ON settings.job_id = me.id LEFT JOIN job_dependencies parents ON parents.child_job_id = me.id LEFT JOIN job_dependencies children ON children.parent_job_id = me.id WHERE ( ( me.clone_id IS NULL AND me.group_id = ? AND me.id IN ( SELECT me.job_id FROM job_settings me LEFT JOIN job_settings siblings ON siblings.job_id = me.job_id LEFT JOIN job_settings siblings_2 ON siblings_2.job_id = me.job_id WHERE ( ( ( me.key = ? AND me.value = ? ) AND ( siblings.key = ? AND siblings.value = ? ) AND ( siblings_2.key = ? AND siblings_2.value = ? ) ) ) ) ) ) ORDER BY me.id DESC: '25', 'DISTRI', 'sle', 'VERSION', '12-SP2', 'BUILD', '1186'. 
 [Sat Feb 27 19:50:22 2016] [debug] [DBIx debug] Took 0.78418899 seconds executed: SELECT me.id, me.slug, me.result_dir, me.state, me.priority, me.result, me.worker_id, me.test, me.clone_id, me.retry_avbl, me.backend, me.backend_info, me.group_id, me.t_started, me.t_finished, me.t_created, me.t_updated, settings.id, settings.key, settings.value, settings.job_id, settings.t_created, settings.t_updated, parents.child_job_id, parents.parent_job_id, parents.dependency, children.child_job_id, children.parent_job_id, children.dependency FROM jobs me LEFT JOIN job_settings settings ON settings.job_id = me.id LEFT JOIN job_dependencies parents ON parents.child_job_id = me.id LEFT JOIN job_dependencies children ON children.parent_job_id = me.id WHERE ( ( me.clone_id IS NULL AND me.group_id = ? AND me.id IN ( SELECT me.job_id FROM job_settings me LEFT JOIN job_settings siblings ON siblings.job_id = me.job_id LEFT JOIN job_settings siblings_2 ON siblings_2.job_id = me.job_id WHERE ( ( ( me.key = ? AND me.value = ? ) AND ( siblings.key = ? AND siblings.value = ? ) AND ( siblings_2.key = ? AND siblings_2.value = ? ) ) ) ) ) ) ORDER BY me.id DESC: '25', 'DISTRI', 'sle', 'VERSION', '12-SP2', 'BUILD', '1186'. 
 [Sat Feb 27 19:50:27 2016] [debug] [DBIx debug] Took 4.06245685 seconds executed: SELECT me.job_id, me.result, me.soft_failure, COUNT( id ) FROM job_modules me WHERE ( job_id IN (…). 
 [Sat Feb 27 19:50:27 2016] [debug] [DBIx debug] Took 0.76173782 seconds executed: SELECT me.id, me.slug, me.result_dir, me.state, me.priority, me.result, me.worker_id, me.test, me.clone_id, me.retry_avbl, me.backend, me.backend_info, me.group_id, me.t_started, me.t_finished, me.t_created, me.t_updated, settings.id, settings.key, settings.value, settings.job_id, settings.t_created, settings.t_updated, parents.child_job_id, parents.parent_job_id, parents.dependency, children.child_job_id, children.parent_job_id, children.dependency FROM jobs me LEFT JOIN job_settings settings ON settings.job_id = me.id LEFT JOIN job_dependencies parents ON parents.child_job_id = me.id LEFT JOIN job_dependencies children ON children.parent_job_id = me.id WHERE ( ( me.clone_id IS NULL AND me.group_id = ? AND me.id IN ( SELECT me.job_id FROM job_settings me LEFT JOIN job_settings siblings ON siblings.job_id = me.job_id LEFT JOIN job_settings siblings_2 ON siblings_2.job_id = me.job_id WHERE ( ( ( me.key = ? AND me.value = ? ) AND ( siblings.key = ? AND siblings.value = ? ) AND ( siblings_2.key = ? AND siblings_2.value = ? ) ) ) ) ) ) ORDER BY me.id DESC: '25', 'DISTRI', 'sle', 'VERSION', '12-SP2', 'BUILD', '1186'. 
 [Sat Feb 27 19:50:28 2016] [debug] [DBIx debug] Took 0.00085783 seconds executed: SELECT me.id, me.job_id, me.group_id, me.text, me.user_id, me.hidden, me.t_created, me.t_updated FROM comments me WHERE ( job_id IN (…). 
 [Sat Feb 27 19:50:28 2016] [debug] [DBIx debug] Took 0.59689903 seconds executed: SELECT me.id, me.job_id, me.name, me.script, me.category, me.soft_failure, me.milestone, me.important, me.fatal, me.result, me.t_created, me.t_updated FROM job_modules me WHERE ( ( me.job_id = ? AND result = ? ) ): '219522', 'failed'. 
 [Sat Feb 27 19:50:28 2016] [debug] [DBIx debug] Took 0.00040007 seconds executed: SELECT me.id, me.slug, me.result_dir, me.state, me.priority, me.result, me.worker_id, me.test, me.clone_id, me.retry_avbl, me.backend, me.backend_info, me.group_id, me.t_started, me.t_finished, me.t_created, me.t_updated FROM jobs me WHERE ( me.id = ? ): '219522'. 
 [Sat Feb 27 19:50:28 2016] [debug] reading /var/lib/openqa/testresults/00219522-sle-12-SP2-Server-DVD-s390x-Build1186-gnome/details-installation_overview.json 
 [Sat Feb 27 19:50:29 2016] [debug] [DBIx debug] Took 0.60596800 seconds executed: SELECT me.id, me.job_id, me.name, me.script, me.category, me.soft_failure, me.milestone, me.important, me.fatal, me.result, me.t_created, me.t_updated FROM job_modules me WHERE ( ( me.job_id = ? AND result = ? ) ): '219521', 'failed'. 
 [Sat Feb 27 19:50:29 2016] [debug] [DBIx debug] Took 0.00020695 seconds executed: SELECT me.id, me.slug, me.result_dir, me.state, me.priority, me.result, me.worker_id, me.test, me.clone_id, me.retry_avbl, me.backend, me.backend_info, me.group_id, me.t_started, me.t_finished, me.t_created, me.t_updated FROM jobs me WHERE ( me.id = ? ): '219521'. 
 [Sat Feb 27 19:50:29 2016] [debug] reading /var/lib/openqa/testresults/00219521-sle-12-SP2-Server-DVD-s390x-Build1186-minimal+base/details-installation_overview_before.json 
 [Sat Feb 27 19:50:29 2016] [debug] [DBIx debug] Took 0.63927102 seconds executed: SELECT me.id, me.job_id, me.name, me.script, me.category, me.soft_failure, me.milestone, me.important, me.fatal, me.result, me.t_created, me.t_updated FROM job_modules me WHERE ( ( me.job_id = ? AND result = ? ) ): '219518', 'failed'. 
 [Sat Feb 27 19:50:29 2016] [debug] [DBIx debug] Took 0.00028086 seconds executed: SELECT me.id, me.slug, me.result_dir, me.state, me.priority, me.result, me.worker_id, me.test, me.clone_id, me.retry_avbl, me.backend, me.backend_info, me.group_id, me.t_started, me.t_finished, me.t_created, me.t_updated FROM jobs me WHERE ( me.id = ? ): '219518'. 
 [Sat Feb 27 19:50:29 2016] [debug] reading /var/lib/openqa/testresults/00219518-sle-12-SP2-Server-DVD-s390x-Build1186-textmode/details-installer_timezone.json 
 [Sat Feb 27 19:5 
 ``` 

 So what I see is many entries like 
 ``` 
 Took 0.63927102 seconds executed: SELECT me.id, … FROM job_modules me WHERE ( ( me.job_id = ? AND result = ? ) ): '219518', 'failed'. 
 ``` 

 ## steps to reproduce 
 * load database dump 2016-02-26 from o.s.d into local PostgreSQL database (e.g. `sudo -u geekotest createdb openqa_suse_de_2016-02-26 && sudo -u geekotest pg_restore -d openqa_suse_de_2016-02-26 --role=geekotest ~/local/os-autoinst/openQA/local/SQL-DUMPS/openqa.suse.de/2016-02-26.dump`) 
 * configure local openQA to load this database, e.g. 

 ``` 
 $ mkdir -p local && cat - echo > local/stage_openqa_suse_de_2016-02-26/database.ini << EOF 
 > [stage] 
 > dsn = dbi:Pg:dbname=openqa 
 > user = geekotest 
 > EOF 
 ``` 

 * load openQA with this database and SQL queries, e.g. 

 ``` 
 $ time sudo -u geekotest OPENQA_SQL_DEBUG=1 OPENQA_CONFIG=local/stage_openqa_suse_de_2016-02-26 OPENQA_DATABASE=stage script/openqa get '/tests/overview?distri=sle&version=12-SP2&build=1201&groupid=25' 
 ``` 
 * observe super long loading time and slow queries 


 ## problem 
 The overall processing time is way to long, main waiting time from queries. My (okurz) local tests yield 

 ``` 
 real 	 0m59.996s 
 user 	 0m1.472s 
 sys 	 0m0.056s 
 ``` 
 to render the request as shown in "steps to reproduce" 

 H1. *REJECTED* - A machine specific problem (DONE: REJECTED, could be reproduced locally) 
 H2. Recent openQA changes introduced a performance regression 
 H3. The database got weird because of recent openQA changes 
 H4. The database got weird due to other effects 

 ## suggestion 

 E1-1. DONE: Try to reproduce locally -> could be reproduced 
 E2-1. Crosscheck with older version, e.g. the one used before last upgrade on o.s.d, if confirmed, git bisect to find culprit 
 E3-1. Run an older database dump with recent openQA 
 E4-1. @waitfor E2-1+E3-1, if both fail, accept H4

Back