Project

General

Profile

Actions

action #10960

closed

current performance problems on o.s.d

Added by okurz about 8 years ago. Updated about 8 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
-
Target version:
-
Start date:
2016-02-28
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

observation

Starting with early Friday, 2016-02-26, a performance regression on o.s.d was notified. I could reproduce what looks like the issue at hand locally. Enabling SQL query debugging I see quers for "job_modules" and "job_settings" taking long

steps to reproduce

  • load database dump 2016-02-26 from o.s.d into local PostgreSQL database (e.g. sudo -u geekotest createdb openqa_suse_de_2016-02-26 && sudo -u geekotest pg_restore -d openqa_suse_de_2016-02-26 --role=geekotest ~/local/os-autoinst/openQA/local/SQL-DUMPS/openqa.suse.de/2016-02-26.dump)
  • configure local openQA to load this database, e.g.
$ mkdir -p local && cat - > local/stage_openqa_suse_de_2016-02-26/database.ini << EOF
[stage]
dsn = dbi:Pg:dbname=openqa_suse_de_2016-02-26
user = geekotest
EOF
  • load openQA with this database and SQL queries, e.g.
$ time sudo -u geekotest OPENQA_SQL_DEBUG=1 OPENQA_CONFIG=local/stage_openqa_suse_de_2016-02-26 OPENQA_DATABASE=stage script/openqa get '/tests/overview?distri=sle&version=12-SP2&build=1201&groupid=25'
  • observe super long loading time and slow queries

problem

The overall processing time is way to long, main waiting time from queries. My (okurz) local tests yield 15s for loading the index page with database dump from 2016-02-26 whereas it was around 2s for 2016-02-25.

H1: REJECTED - A machine specific problem (DONE: REJECTED, see E1-1, could be reproduced locally)
H2: REJECTED - Recent openQA changes introduced a performance regression (DONE: REJECTED, see #10960#note-4 E2-1)
H3: ACCEPTED - The database got weird because of recent openQA changes
H3.1: REJECTED - As more comments are used now because of okurz's changes the parsing gets slow (DONE: REJECTED, see #10960#note-6)
H3.2: ACCEPTED - Other changes cause the slowdown (DONE: ACCEPTED, see #10960#note-6)
H3.2.1: ACCEPTED - Something caused many more jobs considered for job settings queries to appear recently (DONE: ACCEPTED, see #10960#note-7)
H3.2.2: REJECTED - postgreSQL decides on its own that it should consider more jobs in job settings queries trying to do right but failing (DONE: REJECTED, see #10960#note-7)
H4: REJECTED - The database got weird due to other effects (DONE: REJECTED, see E4-1)

suggestion

E1-1: DONE: Try to reproduce locally -> could be reproduced
E2-1: DONE: Crosscheck with older version, e.g. the one used before last upgrade on o.s.d, if confirmed, git bisect to find culprit -> old version (97b8d9238aa918493883199e4da88eef3e578797) does not show an improvement in performance -> database f*ed
E3-1: DONE: Run an older database dump with recent openQA -> older database still fine (see #10960#note-4)
E3.2.1-1: DONE: ask others, don't know what could have caused this -> see #10960#note-7
E3.2.2-1: DONE: ask @coolo as he mentioned something like this recently -> see #10960#note-7
E4-1: DONE: @waitfor E2-1+E3-1, if both fail, accept H4 -> E2-1 FAIL, E3-1 SUCCESS


Subtasks 3 (0 open3 closed)

action #10966: SLES Build1204 overview page times out with "502 Bad Gateway"Resolvedokurz2016-02-28

Actions
action #10976: comments on overview page are not prefetchedResolvedokurz2016-02-28

Actions
action #10998: Scheduling improvementsResolvedcoolo2016-02-29

Actions
Actions #1

Updated by okurz about 8 years ago

  • Description updated (diff)
  • Status changed from New to In Progress
Actions #2

Updated by okurz about 8 years ago

  • Description updated (diff)
Actions #3

Updated by okurz about 8 years ago

  • Description updated (diff)
Actions #4

Updated by okurz about 8 years ago

  • Description updated (diff)

E2-1 conducted.

Experiment

command executed, example for database "2016-02-21", commit "last good":

export date=2016-02-21; export commit=97b8d9238aa918493883199e4da88eef3e578797; time sudo -u geekotest OPENQA_SQL_DEBUG=1 OPENQA_CONFIG=local/stage_openqa_suse_de_${date} OPENQA_DATABASE=stage script/openqa get '/tests/overview?distri=sle&version=12-SP2&build=1171&groupid=25' > local/poo#10960_performance_osd_dump-${date}_with_${commit}_b1171.log 2>&1 ; time sudo -u geekotest OPENQA_SQL_DEBUG=1 OPENQA_CONFIG=local/stage_openqa_suse_de_${date} OPENQA_DATABASE=stage script/openqa get '/' > local/poo#10960_performance_osd_dump-${date}_with_${commit}_index.log 2>&1 ; time sudo -u geekotest OPENQA_SQL_DEBUG=1 OPENQA_CONFIG=local/stage_openqa_suse_de_${date} OPENQA_DATABASE=stage script/openqa get '/group_overview/25' > local/poo#10960_performance_osd_dump-${date}_with_${commit}_overview_25.log 2>&1

"good" on 97b8d9238aa918493883199e4da88eef3e578797 (version installed before AFAIK), "bad" on e55e129fc7a4fd27fd3649f753314064ac8b5128 (origin/master)

Observations

logfiles: https://w3.suse.de/~okurz/openQA_poo%2310960_performance_regression/

setting times (s) for 'overview query', 'index', 'group_overview'
2016-02-26@current 2.3, 15.0, 8.5
2016-02-26@97b8d9238aa918493883199e4da88eef3e578797 2.3, 14.0, 7.8
2016-02-21@current 2.7, 2.5, 2.5
2016-02-21@97b8d9238aa918493883199e4da88eef3e578797 2.3, 1.6, 1.7

Conclusions

C2-1-1: Going back to an older version of openQA software does not show an improvement in performance. The only relevant difference is that the current openQA version renders the "helper templates" and shows it in the logs but with no performance impact.
C2-1-2a: The database from 2016-02-21 was still ok, 2016-02-26 shows a significant degradation

another run with database "2016-02-25":

setting times (s) for 'overview query', 'index', 'group_overview'
2016-02-26@current 2.3, 15.0, 8.5
2016-02-25@current 2.5, 2.2, 2.7

C2-1-2b: The database from 2016-02-25 was still ok, 2016-02-26 shows a significant degradation

Actions #5

Updated by okurz about 8 years ago

  • Description updated (diff)
Actions #6

Updated by okurz about 8 years ago

  • Description updated (diff)

We are parsing the job settings now more often and I found queries with a super long job list when loading the index page, see https://w3.suse.de/~okurz/openQA_poo%2310960_performance_regression/poo%2310960_performance_osd_dump-2016-02-26_with_current_index-2016-02-27-22-54-52_only_long_timed_queries.txt
see first line
or
https://w3.suse.de/~okurz/openQA_poo%2310960_performance_regression/poo%2310960_performance_osd_dump-2016-02-26_with_current_index-2016-02-27-22-54-52.log for the raw log.

Also https://w3.suse.de/~okurz/openQA_poo%2310960_performance_regression/poo%2310960_performance_osd_dump-2016-02-26_with_97b8d9238aa918493883199e4da88eef3e578797_index.log for a run without my changes.

It looks like there are way to many jobs considered for querying the job settings which is actually appearing also when testing against the old installed openQA version. So ACCEPTing H3.2, REJECTing H3.1. But the question is rather: Why are so many jobs considered for the job_settings query?

Actions #7

Updated by okurz about 8 years ago

okurz wrote:

Why are so many jobs considered for the job_settings query?

It's slenkins …

I checked the query and the first and last job id both belong to slenkins so I guess this is the problem. slenkins builds are so many that the corresponding job settings query takes very long, I mean very very long.

Actions #8

Updated by okurz about 8 years ago

  • Assignee set to RBrownSUSE
Actions #9

Updated by okurz about 8 years ago

PR ready openQA gh#582

Combining openQA gh#582 and openQA gh#583
for url in '/tests/overview?distri=sle&version=12-SP2&build=1171&groupid=25' '/' '/group_overview/25' ; do export date=2016-02-26; time sudo -u geekotest OPENQA_CONFIG=local/stage_openqa_suse_de_${date} OPENQA_DATABASE=stage script/openqa get $url > /dev/null 2>&1 ; done 2>&1 | grep real yields

state times (s) for 'overview query', 'index', 'group_overview'
before 2.3, 15.4, 9.1
after 1.3, 2.5, 8.7
+openQA gh#581 1.4, 2.5, 12.7

So openQA gh#582 and openQA gh#583 are a significant improvement, openQA gh#581 so far is not. That's why I marked the latter as WIP.

Actions #10

Updated by okurz about 8 years ago

  • Description updated (diff)
Actions #11

Updated by okurz about 8 years ago

  • Assignee changed from RBrownSUSE to coolo

[29/02/2016 12:50:26] rbrown, okurz: I removed ~50.000 jobs of 1204 - and with some luck I didn't delete anything else ;)

Needs subtasks to complete

Actions #12

Updated by coolo about 8 years ago

in psql:

select * from jobs where id in (select s1.job_id from job_settings s1 left join job_settings s2 on s1.job_id=s2.job_id where s1.key='BUILD' and s1.value='1204' and s2.key='VERSION' and s2.value='12-SP2') and result='obsoleted' order by id \g '/tmp/jobs'

in shell:

for i in $(cat /tmp/jobs); do echo "delete from job_module_needles where job_module_id in (select id from job_modules where job_id=$i); delete from job_modules where job_id=$i; delete from comments where job_id=$i; delete from job_dependencies where parent_job_id=$i; delete from job_dependencies where child_job_id=$i; delete from job_networks where job_id=$i; delete from jobs_assets where job_id=$i; delete from job_settings where job_id=$i; delete from jobs where id=$i;"; done | psql openqa
Actions #13

Updated by coolo about 8 years ago

  • Status changed from In Progress to Resolved

we cleaned the database and improved the scheduling - and the logging of it

Actions

Also available in: Atom PDF