Project

General

Profile

Actions

action #10960

closed

current performance problems on o.s.d

Added by okurz about 8 years ago. Updated about 8 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
-
Target version:
-
Start date:
2016-02-28
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

observation

Starting with early Friday, 2016-02-26, a performance regression on o.s.d was notified. I could reproduce what looks like the issue at hand locally. Enabling SQL query debugging I see quers for "job_modules" and "job_settings" taking long

steps to reproduce

  • load database dump 2016-02-26 from o.s.d into local PostgreSQL database (e.g. sudo -u geekotest createdb openqa_suse_de_2016-02-26 && sudo -u geekotest pg_restore -d openqa_suse_de_2016-02-26 --role=geekotest ~/local/os-autoinst/openQA/local/SQL-DUMPS/openqa.suse.de/2016-02-26.dump)
  • configure local openQA to load this database, e.g.
$ mkdir -p local && cat - > local/stage_openqa_suse_de_2016-02-26/database.ini << EOF
[stage]
dsn = dbi:Pg:dbname=openqa_suse_de_2016-02-26
user = geekotest
EOF
  • load openQA with this database and SQL queries, e.g.
$ time sudo -u geekotest OPENQA_SQL_DEBUG=1 OPENQA_CONFIG=local/stage_openqa_suse_de_2016-02-26 OPENQA_DATABASE=stage script/openqa get '/tests/overview?distri=sle&version=12-SP2&build=1201&groupid=25'
  • observe super long loading time and slow queries

problem

The overall processing time is way to long, main waiting time from queries. My (okurz) local tests yield 15s for loading the index page with database dump from 2016-02-26 whereas it was around 2s for 2016-02-25.

H1: REJECTED - A machine specific problem (DONE: REJECTED, see E1-1, could be reproduced locally)
H2: REJECTED - Recent openQA changes introduced a performance regression (DONE: REJECTED, see #10960#note-4 E2-1)
H3: ACCEPTED - The database got weird because of recent openQA changes
H3.1: REJECTED - As more comments are used now because of okurz's changes the parsing gets slow (DONE: REJECTED, see #10960#note-6)
H3.2: ACCEPTED - Other changes cause the slowdown (DONE: ACCEPTED, see #10960#note-6)
H3.2.1: ACCEPTED - Something caused many more jobs considered for job settings queries to appear recently (DONE: ACCEPTED, see #10960#note-7)
H3.2.2: REJECTED - postgreSQL decides on its own that it should consider more jobs in job settings queries trying to do right but failing (DONE: REJECTED, see #10960#note-7)
H4: REJECTED - The database got weird due to other effects (DONE: REJECTED, see E4-1)

suggestion

E1-1: DONE: Try to reproduce locally -> could be reproduced
E2-1: DONE: Crosscheck with older version, e.g. the one used before last upgrade on o.s.d, if confirmed, git bisect to find culprit -> old version (97b8d9238aa918493883199e4da88eef3e578797) does not show an improvement in performance -> database f*ed
E3-1: DONE: Run an older database dump with recent openQA -> older database still fine (see #10960#note-4)
E3.2.1-1: DONE: ask others, don't know what could have caused this -> see #10960#note-7
E3.2.2-1: DONE: ask @coolo as he mentioned something like this recently -> see #10960#note-7
E4-1: DONE: @waitfor E2-1+E3-1, if both fail, accept H4 -> E2-1 FAIL, E3-1 SUCCESS


Subtasks 3 (0 open3 closed)

action #10966: SLES Build1204 overview page times out with "502 Bad Gateway"Resolvedokurz2016-02-28

Actions
action #10976: comments on overview page are not prefetchedResolvedokurz2016-02-28

Actions
action #10998: Scheduling improvementsResolvedcoolo2016-02-29

Actions
Actions

Also available in: Atom PDF