action #117655
closedQA - coordination #91646: [saga][epic] SUSE Maintenance QA workflows with fully automated testing, approval and release
QA - coordination #117694: [epic] Stable and reliable qem-bot
Provide API to get job results for a particular incident, similar to what dashboard/qem-bot does size:M
Description
Motivation¶
Issues like #117619 made us think about the dashboard and we can see it as just a some sort of a "cache" for openQA results. We think we can provide all necessary information from the openQA directly and avoid the synchronization/translation problems. The main issue is to provide some nice API to "map" incident numbers to a set of openQA job results.
Acceptance criteria¶
- AC1: API endpoint exists to programmatically get a list of all jobs for an incident
Sugestions¶
- Make a followup ticket for an incident search feature exists in the UI to get all job results for an incident (and it is possible to link to the results from an external service).
Updated by mkittler about 2 years ago
On the openQA-side the incident is just a job setting (e.g. INCIDENT_ID=…
). So this boils down to querying jobs by job settings.
Querying jobs by settings is easy to do via SQL (e.g. select job_id, value, (select result from jobs where id = job_id) from job_settings where key = 'UEFI_PFLASH_VARS' and value like '%ovmf%' order by job_id desc limit 50;
) so implementing a route that would do that should be easy.
Updated by okurz about 2 years ago
- Related to action #117619: Bot approved update request with failing tests size:M added
Updated by livdywan about 2 years ago
For example https://openqa.suse.de/tests/9642631#settings
_TEST_ISSUES or _TEST_REPOS could be queried, either efficiently by number lookup or fulltext search respectively - in the former case we'd need to ensure the variables aren't dropped from jobs in the future
Updated by tinita about 2 years ago
cdywan wrote:
For example https://openqa.suse.de/tests/9642631#settings
_TEST_ISSUES or _TEST_REPOS could be queried, either efficiently by number lookup or fulltext search respectively - in the former case we'd need to ensure the variables aren't dropped from jobs in the future
I think both variables are a fulltext search.
BASE_TEST_ISSUES=23246,24207,24462,...
BASE_TEST_REPOS=http://download.suse.de/ibs/SUSE:/Maintenance:/23246/SUSE_Updates_SLE-Module-Basesystem_15-SP4_x86_64/,...
I can't see where we would be able to search by a number.
Updated by kraih about 2 years ago
This feature would be very nice for investigating problems with the dashboard. I would even go so far as to make it usable from the UI too (in addition to a JSON API endpoint), then the dashboard could link to all jobs for a certain incident.
Updated by okurz about 2 years ago
- Target version changed from future to Ready
ok. Sounds good but I would start with JSON only, UI for something later.
Updated by kraih about 2 years ago
okurz wrote:
ok. Sounds good but I would start with JSON only, UI for something later.
Fair enough, i've added that as an AC. And the ui feature as a suggestion for a followup ticket.
Updated by openqa_review about 2 years ago
- Due date set to 2022-11-08
Setting due date based on mean cycle time of SUSE QE Tools
Updated by kraih about 2 years ago
Performance is a little bit of an issue here, since we need to scan two fields for partial matches.
openqa=# explain analyze select * from job_settings where key like '%TEST_ISSUES' and value like '%26023%';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..777812.25 rows=5321 width=97) (actual time=150.485..10369.291 rows=43729 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on job_settings (cost=0.00..776280.15 rows=2217 width=97) (actual time=162.084..10232.458 rows=14576 loops=3)
Filter: ((key ~~ '%TEST_ISSUES'::text) AND (value ~~ '%26023%'::text))
Rows Removed by Filter: 11064132
Planning Time: 0.398 ms
JIT:
Functions: 6
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 2.027 ms, Inlining 227.961 ms, Optimization 113.843 ms, Emission 58.950 ms, Total 402.781 ms
Execution Time: 10400.237 ms
(12 rows)
We might have to look into using the pg_trgm
extension, which would allow the use of a GIN index to speed up LIKE/ILIKE matching. I'll have to do some experiments to see how much that improves performance. On OSD we are dealing with 33236253 rows, and on O3 with 9758036 rows. That's a quite substantial dataset.
Updated by kraih about 2 years ago
Not sure why searching for all known keys in an array was so much worse.
explain analyze select * from job_settings where key = any('{"SERVERAPP_TEST_ISSUES", "SES_TEST_ISSUES", "SLES4SAP_TEST_ISSUES", "ASMM_TEST_ISSUES", "OS_TEST_ISSUES", "LIVE_TEST_ISSUES", "HPCM_TEST_ISSUES", "LTSS_TEST_ISSUES", "DESKTOP_TEST_ISSUES","PYTHON2_TEST_ISSUES","SCRIPT_TEST_ISSUES","CONTM_TEST_ISSUES","RT_TEST_ISSUES","HA_TEST_ISSUES","WSM_TEST_ISSUES","PCM_TEST_ISSUES","WE_TEST_ISSUES","BASE_TEST_ISSUES","TCM_TEST_ISSUES","PYTHON3_TEST_ISSUES","SAP_TEST_ISSUES","SDK_TEST_ISSUES","LEGACY_TEST_ISSUES"}') and value like '%26023%';
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on job_settings (cost=8644.17..602899.54 rows=3322 width=97) (actual time=457.431..143094.311 rows=43731 loops=1)
Recheck Cond: (key = ANY ('{SERVERAPP_TEST_ISSUES,SES_TEST_ISSUES,SLES4SAP_TEST_ISSUES,ASMM_TEST_ISSUES,OS_TEST_ISSUES,LIVE_TEST_ISSUES,HPCM_TEST_ISSUES,LTSS_TEST_ISSUES,DESKTOP_TEST_ISSUES,PYTHON2_TEST_ISSUES,SCRIPT_TEST_ISSUES,CONTM_TEST_ISSUES,RT_TEST_ISSUES,HA_TEST_ISSUES,WSM_TEST_ISSUES,PCM_TEST_ISSUES,WE_TEST_ISSUES,BASE_TEST_ISSUES,TCM_TEST_ISSUES,PYTHON3_TEST_ISSUES,SAP_TEST_ISSUES,SDK_TEST_ISSUES,LEGACY_TEST_ISSUES}'::text[]))
Filter: (value ~~ '%26023%'::text)
Rows Removed by Filter: 922847
Heap Blocks: exact=314261
-> Bitmap Index Scan on idx_value_settings (cost=0.00..8643.28 rows=542173 width=0) (actual time=259.079..259.081 rows=1005205 loops=1)
Index Cond: (key = ANY ('{SERVERAPP_TEST_ISSUES,SES_TEST_ISSUES,SLES4SAP_TEST_ISSUES,ASMM_TEST_ISSUES,OS_TEST_ISSUES,LIVE_TEST_ISSUES,HPCM_TEST_ISSUES,LTSS_TEST_ISSUES,DESKTOP_TEST_ISSUES,PYTHON2_TEST_ISSUES,SCRIPT_TEST_ISSUES,CONTM_TEST_ISSUES,RT_TEST_ISSUES,HA_TEST_ISSUES,WSM_TEST_ISSUES,PCM_TEST_ISSUES,WE_TEST_ISSUES,BASE_TEST_ISSUES,TCM_TEST_ISSUES,PYTHON3_TEST_ISSUES,SAP_TEST_ISSUES,SDK_TEST_ISSUES,LEGACY_TEST_ISSUES}'::text[]))
Planning Time: 0.409 ms
JIT:
Functions: 4
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 0.801 ms, Inlining 9.158 ms, Optimization 24.012 ms, Emission 11.513 ms, Total 45.483 ms
Execution Time: 143139.210 ms
(13 rows)
Updated by kraih about 2 years ago
I have ways to improve performance, but they might have to be discussed because they require PostgreSQL extensions. For now i'll finish the (slow) implementation with unit tests.
Updated by okurz about 2 years ago
- Related to action #119746: [spike][timeboxed:20h] Filter openQA todo-jobs on /tests belonging to one "review squad" size:S added
Updated by okurz about 2 years ago
Discussed in weekly unblock 2022-11-02. pg_trgm might be an option but first we should try to limit the search base, e.g. only select N most recent jobs, e.g. 10k jobs, easily limitable by the job id in job_settings, not needing a join. Also possible to limit by time, e.g. only go back 90 days. But then we need to give the user the option to tweak that, e.g. increase from N jobs or from M days.
Updated by kraih about 2 years ago
This is the feature so far (https://github.com/os-autoinst/openQA/compare/k/incident_search), now i just have to improve performance a bit.
Updated by mkittler about 2 years ago
- Subject changed from Provide API to get job results for a particular incident, similar to what dashboard/qem-bot does to Provide API to get job results for a particular incident, similar to what dashboard/qem-bot does size:M
Updated by kraih about 2 years ago
Some first benchmarking results with limits for t_created
and based on job_id
.
Even with an index on t_created
, limiting the results to 90 days does pretty much nothing:
explain analyze select * from job_settings where key like '%TEST_ISSUES' and value like '%26023%' and t_created > NOW() - interval '90' day;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..881045.84 rows=46 width=95) (actual time=247.489..16445.371 rows=45874 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on job_settings (cost=0.00..880041.24 rows=19 width=95) (actual time=342.451..15959.041 rows=15291 loops=3)
Filter: ((key ~~ '%TEST_ISSUES'::text) AND (value ~~ '%26023%'::text) AND (t_created > (now() - '90 days'::interval day)))
Rows Removed by Filter: 11006193
Planning Time: 0.255 ms
JIT:
Functions: 6
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 4.581 ms, Inlining 528.594 ms, Optimization 305.590 ms, Emission 169.980 ms, Total 1008.745 ms
Execution Time: 16477.522 ms
(12 rows)
Only when you limit it to a value as low as 5 days does it bother to use the index:
explain analyze select * from job_settings where key like '%TEST_ISSUES' and value like '%26023%' and t_created > NOW() - interval '5' day;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Gather (cost=12553.51..602959.31 rows=2 width=95) (actual time=262.828..857.520 rows=4694 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Bitmap Heap Scan on job_settings (cost=11553.51..601959.11 rows=1 width=95) (actual time=434.889..619.179 rows=1565 loops=3)
Recheck Cond: (t_created > (now() - '5 days'::interval day))
Filter: ((key ~~ '%TEST_ISSUES'::text) AND (value ~~ '%26023%'::text))
Rows Removed by Filter: 463509
Heap Blocks: exact=26091
-> Bitmap Index Scan on job_settings_t_created_idx (cost=0.00..11553.51 rows=1051342 width=0) (actual time=249.785..249.787 rows=1395222 loops=1)
Index Cond: (t_created > (now() - '5 days'::interval day))
Planning Time: 0.405 ms
JIT:
Functions: 15
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 5.654 ms, Inlining 524.190 ms, Optimization 333.531 ms, Emission 132.677 ms, Total 996.053 ms
Execution Time: 861.819 ms
(16 rows)
The job_id
limit performs quite well though:
openqa=# drop index job_settings_t_created_idx;
DROP INDEX
openqa=# select max(job_id) from job_settings;
max
---------
9866294
(1 row)
openqa=# explain analyze select * from job_settings where key like '%TEST_ISSUES' and value like '%26023%' and job_id > 9856294;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on job_settings (cost=2715.53..432973.63 rows=1 width=95) (actual time=50.623..291.816 rows=99 loops=1)
Recheck Cond: (job_id > 9856294)
Filter: ((key ~~ '%TEST_ISSUES'::text) AND (value ~~ '%26023%'::text))
Rows Removed by Filter: 470135
Heap Blocks: exact=9986
-> Bitmap Index Scan on job_settings_idx_job_id (cost=0.00..2715.52 rows=223345 width=0) (actual time=24.384..24.386 rows=470547 loops=1)
Index Cond: (job_id > 9856294)
Planning Time: 0.585 ms
JIT:
Functions: 4
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 4.009 ms, Inlining 0.000 ms, Optimization 6.386 ms, Emission 8.188 ms, Total 18.583 ms
Execution Time: 296.938 ms
(13 rows)
Even if we were to go back 50k job ids it seems ok-ish:
explain analyze select * from job_settings where key like '%TEST_ISSUES' and value like '%26023%' and job_id > 9816294;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Gather (cost=29053.83..615952.68 rows=5 width=95) (actual time=1300.236..1998.500 rows=7709 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Bitmap Heap Scan on job_settings (cost=28053.83..614952.18 rows=2 width=95) (actual time=1289.221..1832.214 rows=2570 loops=3)
Recheck Cond: (job_id > 9816294)
Filter: ((key ~~ '%TEST_ISSUES'::text) AND (value ~~ '%26023%'::text))
Rows Removed by Filter: 843520
Heap Blocks: exact=30864
-> Bitmap Index Scan on job_settings_idx_job_id (cost=0.00..28053.83 rows=2308985 width=0) (actual time=1185.370..1185.371 rows=2541283 loops=1)
Index Cond: (job_id > 9816294)
Planning Time: 0.266 ms
JIT:
Functions: 12
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 3.517 ms, Inlining 248.875 ms, Optimization 119.009 ms, Emission 80.621 ms, Total 452.021 ms
Execution Time: 2003.399 ms
(16 rows)
Updated by kraih about 2 years ago
And next up pg_trgm
. This thing really rocks.
openqa-osd=> CREATE EXTENSION pg_trgm;
CREATE EXTENSION
openqa-osd=> CREATE INDEX ON job_settings USING gin (key gin_trgm_ops);
CREATE INDEX
openqa-osd=> CREATE INDEX ON job_settings USING gin (value gin_trgm_ops);
CREATE INDEX
It seems to know that it only really needs the value
index, and it makes a huge difference:
openqa-osd=> explain analyze select * from job_settings where key like '%TEST_ISSUES' and value like '%26023%';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on job_settings (cost=263.11..7959.94 rows=58 width=96) (actual time=47.152..372.824 rows=45874 loops=1)
Recheck Cond: (value ~~ '%26023%'::text)
Rows Removed by Index Recheck: 3
Filter: (key ~~ '%TEST_ISSUES'::text)
Rows Removed by Filter: 46854
Heap Blocks: exact=51833
-> Bitmap Index Scan on job_settings_value_idx (cost=0.00..263.10 rows=2013 width=0) (actual time=35.969..35.970 rows=92731 loops=1)
Index Cond: (value ~~ '%26023%'::text)
Planning Time: 1.216 ms
Execution Time: 374.051 ms
(10 rows)
On my development VM creating the index takes about 4 minutes with a very fast SSD. And they do take up quite a bit of space:
public | job_settings_key_idx | index | sri | job_settings | permanent | gin | 679 MB |
public | job_settings_value_idx | index | sri | job_settings | permanent | gin | 1534 MB |
Updated by kraih about 2 years ago
And for completeness sake a compound trigram index on both fields:
openqa-osd=> CREATE INDEX ON job_settings USING gin (key gin_trgm_ops, value gin_trgm_ops);
CREATE INDEX
And it's pretty fast:
openqa-osd=> explain analyze select * from job_settings where key like '%TEST_ISSUES' and value like '%26023%';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on job_settings (cost=952.59..1183.63 rows=58 width=96) (actual time=126.601..279.679 rows=45874 loops=1)
Recheck Cond: ((key ~~ '%TEST_ISSUES'::text) AND (value ~~ '%26023%'::text))
Heap Blocks: exact=37127
-> Bitmap Index Scan on job_settings_key_value_idx (cost=0.00..952.58 rows=58 width=0) (actual time=122.977..122.977 rows=45874 loops=1)
Index Cond: ((key ~~ '%TEST_ISSUES'::text) AND (value ~~ '%26023%'::text))
Planning Time: 1.018 ms
Execution Time: 280.836 ms
(7 rows)
But also large (took about 6 minutes to create the index):
public | job_settings_key_value_idx | index | sri | job_settings | permanent | gin | 2211 MB |
Updated by kraih about 2 years ago
PR opened with the job id limit solution: https://github.com/os-autoinst/openQA/pull/4888
Updated by kraih about 2 years ago
Looks like this ticket has to go back to the drawing board, since there have been too many votes against an incident specific API.
Updated by kraih about 2 years ago
- Due date changed from 2022-11-08 to 2022-11-10
Updated by kraih about 2 years ago
Updated PR with a more generic solution: https://github.com/os-autoinst/openQA/pull/4888
Updated by kraih about 2 years ago
- Related to action #120115: [qem-dashboard] Use new openQA job settings API to show more data added
Updated by kraih about 2 years ago
- Related to action #120118: Provide job settings filter options added
Updated by kraih about 2 years ago
Updated by okurz about 2 years ago
- Due date deleted (
2022-11-10) - Status changed from Feedback to Resolved
Examples:
openqa-cli api --o3 /job_settings/jobs key=*_TEST_ISSUES list_value=1234567
openqa-cli api --osd /job_settings/jobs key=LTSS_TEST_ISSUES list_value=20988
we tried that out and it worked to return a list of jobs. For now I would call this sufficient. We can think about uses of that or extensions in the future.