Project

General

Profile

action #117655

QA - coordination #91646: [saga][epic] SUSE Maintenance QA workflows with fully automated testing, approval and release

QA - coordination #117694: [epic] Stable and reliable qem-bot

Provide API to get job results for a particular incident, similar to what dashboard/qem-bot does size:M

Added by jbaier_cz about 2 months ago. Updated 23 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2022-10-06
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Motivation

Issues like #117619 made us think about the dashboard and we can see it as just a some sort of a "cache" for openQA results. We think we can provide all necessary information from the openQA directly and avoid the synchronization/translation problems. The main issue is to provide some nice API to "map" incident numbers to a set of openQA job results.

Acceptance criteria

  • AC1: API endpoint exists to programmatically get a list of all jobs for an incident

Sugestions

  • Make a followup ticket for an incident search feature exists in the UI to get all job results for an incident (and it is possible to link to the results from an external service).

Related issues

Related to QA - action #117619: Bot approved update request with failing tests size:MResolved

Related to openQA Project - action #119746: [spike][timeboxed:20h] Filter openQA todo-jobs on /tests belonging to one "review squad"New2022-11-02

Related to openQA Infrastructure - action #120115: [qem-dashboard] Use new openQA job settings API to show more dataNew2022-11-08

Related to openQA Project - action #120118: Provide job settings filter optionsNew2022-11-08

History

#1 Updated by mkittler about 2 months ago

On the openQA-side the incident is just a job setting (e.g. INCIDENT_ID=…). So this boils down to querying jobs by job settings.

Querying jobs by settings is easy to do via SQL (e.g. select job_id, value, (select result from jobs where id = job_id) from job_settings where key = 'UEFI_PFLASH_VARS' and value like '%ovmf%' order by job_id desc limit 50;) so implementing a route that would do that should be easy.

#2 Updated by okurz about 2 months ago

  • Related to action #117619: Bot approved update request with failing tests size:M added

#3 Updated by okurz about 2 months ago

  • Parent task set to #117694

#4 Updated by cdywan about 2 months ago

For example https://openqa.suse.de/tests/9642631#settings

_TEST_ISSUES or _TEST_REPOS could be queried, either efficiently by number lookup or fulltext search respectively - in the former case we'd need to ensure the variables aren't dropped from jobs in the future

#5 Updated by tinita about 2 months ago

cdywan wrote:

For example https://openqa.suse.de/tests/9642631#settings

_TEST_ISSUES or _TEST_REPOS could be queried, either efficiently by number lookup or fulltext search respectively - in the former case we'd need to ensure the variables aren't dropped from jobs in the future

I think both variables are a fulltext search.

BASE_TEST_ISSUES=23246,24207,24462,...
BASE_TEST_REPOS=http://download.suse.de/ibs/SUSE:/Maintenance:/23246/SUSE_Updates_SLE-Module-Basesystem_15-SP4_x86_64/,...

I can't see where we would be able to search by a number.

#6 Updated by okurz about 2 months ago

  • Target version changed from Ready to future

#7 Updated by kraih about 2 months ago

This feature would be very nice for investigating problems with the dashboard. I would even go so far as to make it usable from the UI too (in addition to a JSON API endpoint), then the dashboard could link to all jobs for a certain incident.

#8 Updated by okurz about 2 months ago

  • Target version changed from future to Ready

ok. Sounds good but I would start with JSON only, UI for something later.

#9 Updated by kraih about 2 months ago

  • Description updated (diff)

#10 Updated by kraih about 2 months ago

  • Description updated (diff)

#11 Updated by kraih about 2 months ago

okurz wrote:

ok. Sounds good but I would start with JSON only, UI for something later.

Fair enough, i've added that as an AC. And the ui feature as a suggestion for a followup ticket.

#12 Updated by kraih about 2 months ago

  • Assignee set to kraih

#13 Updated by kraih about 1 month ago

  • Status changed from New to In Progress

#14 Updated by openqa_review about 1 month ago

  • Due date set to 2022-11-08

Setting due date based on mean cycle time of SUSE QE Tools

#15 Updated by kraih about 1 month ago

Performance is a little bit of an issue here, since we need to scan two fields for partial matches.

openqa=# explain analyze select * from job_settings where key like '%TEST_ISSUES' and value like '%26023%';
                                                               QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..777812.25 rows=5321 width=97) (actual time=150.485..10369.291 rows=43729 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on job_settings  (cost=0.00..776280.15 rows=2217 width=97) (actual time=162.084..10232.458 rows=14576 loops=3)
         Filter: ((key ~~ '%TEST_ISSUES'::text) AND (value ~~ '%26023%'::text))
         Rows Removed by Filter: 11064132
 Planning Time: 0.398 ms
 JIT:
   Functions: 6
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 2.027 ms, Inlining 227.961 ms, Optimization 113.843 ms, Emission 58.950 ms, Total 402.781 ms
 Execution Time: 10400.237 ms
(12 rows)

We might have to look into using the pg_trgm extension, which would allow the use of a GIN index to speed up LIKE/ILIKE matching. I'll have to do some experiments to see how much that improves performance. On OSD we are dealing with 33236253 rows, and on O3 with 9758036 rows. That's a quite substantial dataset.

#16 Updated by kraih about 1 month ago

Not sure why searching for all known keys in an array was so much worse.

explain analyze select * from job_settings where key = any('{"SERVERAPP_TEST_ISSUES", "SES_TEST_ISSUES", "SLES4SAP_TEST_ISSUES", "ASMM_TEST_ISSUES", "OS_TEST_ISSUES", "LIVE_TEST_ISSUES", "HPCM_TEST_ISSUES", "LTSS_TEST_ISSUES", "DESKTOP_TEST_ISSUES","PYTHON2_TEST_ISSUES","SCRIPT_TEST_ISSUES","CONTM_TEST_ISSUES","RT_TEST_ISSUES","HA_TEST_ISSUES","WSM_TEST_ISSUES","PCM_TEST_ISSUES","WE_TEST_ISSUES","BASE_TEST_ISSUES","TCM_TEST_ISSUES","PYTHON3_TEST_ISSUES","SAP_TEST_ISSUES","SDK_TEST_ISSUES","LEGACY_TEST_ISSUES"}') and value like '%26023%';
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on job_settings  (cost=8644.17..602899.54 rows=3322 width=97) (actual time=457.431..143094.311 rows=43731 loops=1)
   Recheck Cond: (key = ANY ('{SERVERAPP_TEST_ISSUES,SES_TEST_ISSUES,SLES4SAP_TEST_ISSUES,ASMM_TEST_ISSUES,OS_TEST_ISSUES,LIVE_TEST_ISSUES,HPCM_TEST_ISSUES,LTSS_TEST_ISSUES,DESKTOP_TEST_ISSUES,PYTHON2_TEST_ISSUES,SCRIPT_TEST_ISSUES,CONTM_TEST_ISSUES,RT_TEST_ISSUES,HA_TEST_ISSUES,WSM_TEST_ISSUES,PCM_TEST_ISSUES,WE_TEST_ISSUES,BASE_TEST_ISSUES,TCM_TEST_ISSUES,PYTHON3_TEST_ISSUES,SAP_TEST_ISSUES,SDK_TEST_ISSUES,LEGACY_TEST_ISSUES}'::text[]))
   Filter: (value ~~ '%26023%'::text)
   Rows Removed by Filter: 922847
   Heap Blocks: exact=314261
   ->  Bitmap Index Scan on idx_value_settings  (cost=0.00..8643.28 rows=542173 width=0) (actual time=259.079..259.081 rows=1005205 loops=1)
         Index Cond: (key = ANY ('{SERVERAPP_TEST_ISSUES,SES_TEST_ISSUES,SLES4SAP_TEST_ISSUES,ASMM_TEST_ISSUES,OS_TEST_ISSUES,LIVE_TEST_ISSUES,HPCM_TEST_ISSUES,LTSS_TEST_ISSUES,DESKTOP_TEST_ISSUES,PYTHON2_TEST_ISSUES,SCRIPT_TEST_ISSUES,CONTM_TEST_ISSUES,RT_TEST_ISSUES,HA_TEST_ISSUES,WSM_TEST_ISSUES,PCM_TEST_ISSUES,WE_TEST_ISSUES,BASE_TEST_ISSUES,TCM_TEST_ISSUES,PYTHON3_TEST_ISSUES,SAP_TEST_ISSUES,SDK_TEST_ISSUES,LEGACY_TEST_ISSUES}'::text[]))
 Planning Time: 0.409 ms
 JIT:
   Functions: 4
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 0.801 ms, Inlining 9.158 ms, Optimization 24.012 ms, Emission 11.513 ms, Total 45.483 ms
 Execution Time: 143139.210 ms
(13 rows)

#17 Updated by kraih about 1 month ago

I have ways to improve performance, but they might have to be discussed because they require PostgreSQL extensions. For now i'll finish the (slow) implementation with unit tests.

#18 Updated by okurz about 1 month ago

  • Related to action #119746: [spike][timeboxed:20h] Filter openQA todo-jobs on /tests belonging to one "review squad" added

#19 Updated by okurz about 1 month ago

Discussed in weekly unblock 2022-11-02. pg_trgm might be an option but first we should try to limit the search base, e.g. only select N most recent jobs, e.g. 10k jobs, easily limitable by the job id in job_settings, not needing a join. Also possible to limit by time, e.g. only go back 90 days. But then we need to give the user the option to tweak that, e.g. increase from N jobs or from M days.

#20 Updated by kraih about 1 month ago

This is the feature so far (https://github.com/os-autoinst/openQA/compare/k/incident_search), now i just have to improve performance a bit.

#21 Updated by mkittler about 1 month ago

  • Subject changed from Provide API to get job results for a particular incident, similar to what dashboard/qem-bot does to Provide API to get job results for a particular incident, similar to what dashboard/qem-bot does size:M

#22 Updated by kraih about 1 month ago

Some first benchmarking results with limits for t_created and based on job_id.

Even with an index on t_created, limiting the results to 90 days does pretty much nothing:

explain analyze select * from job_settings where key like '%TEST_ISSUES' and value like '%26023%' and t_created > NOW() - interval '90' day;
                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..881045.84 rows=46 width=95) (actual time=247.489..16445.371 rows=45874 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on job_settings  (cost=0.00..880041.24 rows=19 width=95) (actual time=342.451..15959.041 rows=15291 loops=3)
         Filter: ((key ~~ '%TEST_ISSUES'::text) AND (value ~~ '%26023%'::text) AND (t_created > (now() - '90 days'::interval day)))
         Rows Removed by Filter: 11006193
 Planning Time: 0.255 ms
 JIT:
   Functions: 6
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 4.581 ms, Inlining 528.594 ms, Optimization 305.590 ms, Emission 169.980 ms, Total 1008.745 ms
 Execution Time: 16477.522 ms
(12 rows)

Only when you limit it to a value as low as 5 days does it bother to use the index:

explain analyze select * from job_settings where key like '%TEST_ISSUES' and value like '%26023%' and t_created > NOW() - interval '5' day;
                                                                         QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=12553.51..602959.31 rows=2 width=95) (actual time=262.828..857.520 rows=4694 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Bitmap Heap Scan on job_settings  (cost=11553.51..601959.11 rows=1 width=95) (actual time=434.889..619.179 rows=1565 loops=3)
         Recheck Cond: (t_created > (now() - '5 days'::interval day))
         Filter: ((key ~~ '%TEST_ISSUES'::text) AND (value ~~ '%26023%'::text))
         Rows Removed by Filter: 463509
         Heap Blocks: exact=26091
         ->  Bitmap Index Scan on job_settings_t_created_idx  (cost=0.00..11553.51 rows=1051342 width=0) (actual time=249.785..249.787 rows=1395222 loops=1)
               Index Cond: (t_created > (now() - '5 days'::interval day))
 Planning Time: 0.405 ms
 JIT:
   Functions: 15
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 5.654 ms, Inlining 524.190 ms, Optimization 333.531 ms, Emission 132.677 ms, Total 996.053 ms
 Execution Time: 861.819 ms
(16 rows)

The job_id limit performs quite well though:

openqa=# drop index job_settings_t_created_idx;
DROP INDEX
openqa=# select max(job_id) from job_settings;
   max
---------
 9866294
(1 row)

openqa=# explain analyze select * from job_settings where key like '%TEST_ISSUES' and value like '%26023%' and job_id > 9856294;
                                                                  QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on job_settings  (cost=2715.53..432973.63 rows=1 width=95) (actual time=50.623..291.816 rows=99 loops=1)
   Recheck Cond: (job_id > 9856294)
   Filter: ((key ~~ '%TEST_ISSUES'::text) AND (value ~~ '%26023%'::text))
   Rows Removed by Filter: 470135
   Heap Blocks: exact=9986
   ->  Bitmap Index Scan on job_settings_idx_job_id  (cost=0.00..2715.52 rows=223345 width=0) (actual time=24.384..24.386 rows=470547 loops=1)
         Index Cond: (job_id > 9856294)
 Planning Time: 0.585 ms
 JIT:
   Functions: 4
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 4.009 ms, Inlining 0.000 ms, Optimization 6.386 ms, Emission 8.188 ms, Total 18.583 ms
 Execution Time: 296.938 ms
(13 rows)

Even if we were to go back 50k job ids it seems ok-ish:

explain analyze select * from job_settings where key like '%TEST_ISSUES' and value like '%26023%' and job_id > 9816294;
                                                                         QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=29053.83..615952.68 rows=5 width=95) (actual time=1300.236..1998.500 rows=7709 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Bitmap Heap Scan on job_settings  (cost=28053.83..614952.18 rows=2 width=95) (actual time=1289.221..1832.214 rows=2570 loops=3)
         Recheck Cond: (job_id > 9816294)
         Filter: ((key ~~ '%TEST_ISSUES'::text) AND (value ~~ '%26023%'::text))
         Rows Removed by Filter: 843520
         Heap Blocks: exact=30864
         ->  Bitmap Index Scan on job_settings_idx_job_id  (cost=0.00..28053.83 rows=2308985 width=0) (actual time=1185.370..1185.371 rows=2541283 loops=1)
               Index Cond: (job_id > 9816294)
 Planning Time: 0.266 ms
 JIT:
   Functions: 12
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 3.517 ms, Inlining 248.875 ms, Optimization 119.009 ms, Emission 80.621 ms, Total 452.021 ms
 Execution Time: 2003.399 ms
(16 rows)

#23 Updated by kraih about 1 month ago

And next up pg_trgm. This thing really rocks.

openqa-osd=> CREATE EXTENSION pg_trgm;
CREATE EXTENSION
openqa-osd=> CREATE INDEX ON job_settings USING gin (key gin_trgm_ops);
CREATE INDEX
openqa-osd=> CREATE INDEX ON job_settings USING gin (value gin_trgm_ops);
CREATE INDEX

It seems to know that it only really needs the value index, and it makes a huge difference:

openqa-osd=> explain analyze select * from job_settings where key like '%TEST_ISSUES' and value like '%26023%';
                                                                QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on job_settings  (cost=263.11..7959.94 rows=58 width=96) (actual time=47.152..372.824 rows=45874 loops=1)
   Recheck Cond: (value ~~ '%26023%'::text)
   Rows Removed by Index Recheck: 3
   Filter: (key ~~ '%TEST_ISSUES'::text)
   Rows Removed by Filter: 46854
   Heap Blocks: exact=51833
   ->  Bitmap Index Scan on job_settings_value_idx  (cost=0.00..263.10 rows=2013 width=0) (actual time=35.969..35.970 rows=92731 loops=1)
         Index Cond: (value ~~ '%26023%'::text)
 Planning Time: 1.216 ms
 Execution Time: 374.051 ms
(10 rows)

On my development VM creating the index takes about 4 minutes with a very fast SSD. And they do take up quite a bit of space:

public | job_settings_key_idx       | index | sri   | job_settings | permanent   | gin           | 679 MB  |
public | job_settings_value_idx     | index | sri   | job_settings | permanent   | gin           | 1534 MB |

#24 Updated by kraih about 1 month ago

And for completeness sake a compound trigram index on both fields:

openqa-osd=> CREATE INDEX ON job_settings USING gin (key gin_trgm_ops, value gin_trgm_ops);
CREATE INDEX

And it's pretty fast:

openqa-osd=> explain analyze select * from job_settings where key like '%TEST_ISSUES' and value like '%26023%';
                                                                  QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on job_settings  (cost=952.59..1183.63 rows=58 width=96) (actual time=126.601..279.679 rows=45874 loops=1)
   Recheck Cond: ((key ~~ '%TEST_ISSUES'::text) AND (value ~~ '%26023%'::text))
   Heap Blocks: exact=37127
   ->  Bitmap Index Scan on job_settings_key_value_idx  (cost=0.00..952.58 rows=58 width=0) (actual time=122.977..122.977 rows=45874 loops=1)
         Index Cond: ((key ~~ '%TEST_ISSUES'::text) AND (value ~~ '%26023%'::text))
 Planning Time: 1.018 ms
 Execution Time: 280.836 ms
(7 rows)

But also large (took about 6 minutes to create the index):

public | job_settings_key_value_idx | index | sri   | job_settings | permanent   | gin           | 2211 MB |

#25 Updated by kraih 30 days ago

PR opened with the job id limit solution: https://github.com/os-autoinst/openQA/pull/4888

#26 Updated by kraih 30 days ago

Looks like this ticket has to go back to the drawing board, since there have been too many votes against an incident specific API.

#27 Updated by kraih 27 days ago

  • Due date changed from 2022-11-08 to 2022-11-10

#28 Updated by kraih 27 days ago

Updated PR with a more generic solution: https://github.com/os-autoinst/openQA/pull/4888

#29 Updated by kraih 26 days ago

  • Status changed from In Progress to Feedback

#30 Updated by kraih 26 days ago

  • Related to action #120115: [qem-dashboard] Use new openQA job settings API to show more data added

#31 Updated by kraih 26 days ago

  • Related to action #120118: Provide job settings filter options added

#32 Updated by kraih 26 days ago

I've prepared two possible followups. One on the openQA side: #120118, and one on the qem-dashboard side: #120115. My preference would probably be the solution on the openQA side, since it's more versatile and provides more information to the test reviewers.

#33 Updated by okurz 23 days ago

  • Due date deleted (2022-11-10)
  • Status changed from Feedback to Resolved

Examples:

openqa-cli api --o3  /job_settings/jobs key=*_TEST_ISSUES list_value=1234567
openqa-cli api --osd  /job_settings/jobs key=LTSS_TEST_ISSUES list_value=20988

we tried that out and it worked to return a list of jobs. For now I would call this sufficient. We can think about uses of that or extensions in the future.

Also available in: Atom PDF