action #76903: openQA test modules failure statistics collection - openQA Infrastructure - openSUSE Project Management Tool

Actions

Copy link

action #76903

closed

openQA test modules failure statistics collection

Added by vpelcak about 4 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

High

Assignee:

okurz

Category:

Target version:

openQA Project - Ready

Start date:

2020-11-03

Due date:

2020-12-03

% Done:

Estimated time:

Description

User Story¶

I would like us to be able to have some statistics about openQA test modules failures.
How often each test modules fails, which fail the most often, what are the trends of failure rate....

Acceptance criteria¶

AC1: Current job module details from OSD are available on a SUSE instance of Metabase

Out of scope¶

Persistent database setup and configuration on osd with salt -> #78222

Suggestions¶

Use the internal Metabase instance (login by OKTA credentials, but as login put only part of e-mail address before the @)
Look into exposing read-only database access to make queries available, e.g. similar to what we already do for "telegraf" in https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/server.sls#L163 but including the "job_modules" table
- take https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/395 as inspiration
Allow access to postgres database from outside "localhost":
- Research security implications and "best practices", e.g. follow https://blog.bigbinary.com/2016/01/23/configure-postgresql-to-allow-remote-connection.html
- Change postgres "listen" address in postgresql.conf to listen to the ethernet device and not only localhost, e.g. postgr
- Change postgres config in pg_hba.conf to allow access for remote connections

Files

Download all files

metabase_openqa_suse_de_job_modules_overview_2020-11-19.png (68.1 KB) metabase_openqa_suse_de_job_modules_overview_2020-11-19.png		okurz, 2020-11-19 14:30
which_openQA_testsuites_and_test_modules_fail_the_most_on_OSD_metabase.png (83.2 KB) which_openQA_testsuites_and_test_modules_fail_the_most_on_OSD_metabase.png		okurz, 2021-07-16 13:59

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by livdywan about 4 years ago

Description updated (diff)

Actions

Copy link

Updated by okurz about 4 years ago

Subject changed from openQA testcase failure statistics collection to openQA test modules failure statistics collection
Description updated (diff)
Due date set to 2020-12-01
Status changed from New to Feedback
Assignee set to okurz
Target version set to Ready

Nice to have this ticket. I already thought about generic "read-only access to the openQA database" in general.

As soon as we have capacity in the team's backlog we can pick it up, hence setting to "Future" for now.

I clarified with vpelcak that the initial request was about "test modules", adapted issue description accordingly.

To answer once right now the question which test modules fail how often:

$ for i in o3 osd; do ssh $i "echo \"select script,count(id) from job_modules where result='failed' group by script order by count DESC limit 10;\" | sudo -u geekotest psql openqa"; done
                script                 | count 
---------------------------------------+-------
 tests/kernel/run_ltp.pm               | 22592
 tests/installation/start_install.pm   |  1818
 tests/shutdown/shutdown.pm            |   994
 tests/x11/firefox.pm                  |   767
 tests/x11/vlc.pm                      |   739
 tests/installation/zdup.pm            |   725
 tests/x11/chromium.pm                 |   724
 tests/console/consoletest_finish.pm   |   600
 tests/installation/bootloader_uefi.pm |   550
 tests/console/rails.pm                |   468
(10 rows)

                  script                  | count 
------------------------------------------+-------
 tests/kernel/run_ltp.pm                  | 14767
 tests/update/patch_sle.pm                |  3244
 tests/boot/boot_to_desktop.pm            |  2382
 tests/installation/addon_products_sle.pm |  1991
 tests/installation/upgrade_select.pm     |  1989
 tests/xfstests/generate_report.pm        |  1964
 tests/installation/welcome.pm            |  1936
 tests/installation/scc_registration.pm   |  1581
 tests/kernel/boot_ltp.pm                 |  1549
 tests/installation/first_boot.pm         |  1485
(10 rows)

to understand why "run_ltp" is so high up in the statistic: There can by "dynamic test modules" where one "perl script module" can be parameterized to run actual generated test modules. So it looks different if we look "by category and name":

$ for i in o3 osd; do ssh $i "echo \"select category,name,count(id) from job_modules where result='failed' group by category,name order by count DESC limit 10;\" | sudo -u geekotest psql openqa"; done
   category   |        name        | count 
--------------+--------------------+-------
 installation | start_install      |  1818
 shutdown     | shutdown           |   994
 x11          | firefox            |   767
 x11          | vlc                |   739
 installation | zdup               |   725
 x11          | chromium           |   724
 console      | consoletest_finish |   600
 installation | bootloader_uefi    |   519
 console      | rails              |   468
 console      | mysql_odbc         |   445
(10 rows)

   category   |        name        | count 
--------------+--------------------+-------
 update       | patch_sle          |  3244
 boot         | boot_to_desktop    |  2307
 installation | addon_products_sle |  1991
 installation | upgrade_select     |  1989
 installation | welcome            |  1936
 installation | scc_registration   |  1581
 kernel       | boot_ltp           |  1540
 installation | first_boot         |  1485
 updinstall   | update_install     |  1424
 console      | zypper_lifecycle   |  1327
(10 rows)

Now regarding metabase I think it's a nice idea especially considering that a metabase instance already exists (managed to login myself with username "okurz"). I think we need to follow
https://www.metabase.com/docs/v0.35.4/setting-up-metabase.html#gathering-your-database-info
and for that in my understanding we need an "Admin" on the instance.

@vpelcak can you either make me/us admin or find an existing admin to add the corresponding openQA database access?

Actions

Copy link

Updated by VANASTASIADIS about 4 years ago

@okurz, Gustavo is an admin. I talked to him and he can add the db and any security restrictions as well, but he needs the credentials to access the db.
He suggested, and I agree, to create a read-only user on the db for metabase, if there isn't one already.

Actions

Copy link

Updated by okurz about 4 years ago

ok, thanks for your help and great to see your comment.

I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/395 which should ensure there is a postgres user "openqa" but I will need to check first if this works with salt and how to enable access from the outside

Actions

Copy link

Updated by okurz about 4 years ago

Description updated (diff)
Due date deleted (~~2020-12-01~~)
Status changed from Feedback to Workable
Assignee deleted (~~okurz~~)
Priority changed from Normal to Low

Updated with suggestions/tasks

Actions

Copy link

Updated by okurz about 4 years ago

Target version changed from Ready to future

Actions

Copy link

Updated by okurz about 4 years ago

Would you like to pick up this ticket? I can provide help and support but just not following on right now myself.

Actions

Copy link

Updated by okurz about 4 years ago

Target version changed from future to Ready

Actions

Copy link

Updated by okurz about 4 years ago

discussed with @bzoltan1 and @VANASTASIADIS . Idea: Change query to limit to "last 365 days" and "maintenance job groups".

hypothesis: are more test modules failing in the last half a year than in before? I.e. ratio "failing / passed"

We need a rough answer for some data quick, i.e. "days". @VANASTASIADIS should hack up some python script to get median/mean/std for "qam tickets" in past year or so. And then in scope of "weeks" we work on connecting openQA db to metabase for future cool stuff :)

Actions

Copy link

#10

Updated by okurz about 4 years ago

Copied to action #78222: Persistent postgres setup and configuration in salt added

Actions

Copy link

#11

Updated by okurz about 4 years ago

Description updated (diff)
Status changed from Workable to In Progress
Assignee set to okurz

I have moved out the salt specific parts into a separate ticket as we do not have the postgres database setup covered with salt in general so far except for the "telegraf" roles we add, which is a good start though.

I will look into my MR again.

Actions

Copy link

#12

Updated by okurz about 4 years ago

Priority changed from Low to High

the ticket is actually more important to QAM so bumping to High.

Actions

Copy link

#13

Updated by VANASTASIADIS about 4 years ago

@okurz @bzoltan1
First results from the redmine api scripts:

For all tickets with [qam] or [qem] in the subject:
All time MEDIAN time open->close: 29.14 DAYS
Last year MEDIAN time open->close: 47.95 DAYS

All time MEAN time open->close: 86.88 DAYS
Last year MEAN time open->close: 124.67 DAYS

Done by calculating closing - creation date of a ticket.
I use the closing date of a ticket as the ticket's overall date, to decide whether it's in the last year or not.

I removed tickets with 0 time elapsed between opening and closing of a ticket, but I did not remove anything else. If you have any ideas on how to filter out more tickets, I'll implement that too.

Actions

Copy link

#14

Updated by VANASTASIADIS about 4 years ago

Some more detailed results:

All time:
Tickets that took 30 days were 591
Tickets that take >30 days are 49.25% of total functional closed tickets
Median of tickets that took <30 days: 5.78 days
Mean of tickets that took <30 days: 8.43 days

Last year:
Tickets that took 30 days were 464
Tickets that take >30 days are 55.17% of total functional closed tickets
Median of tickets that took <30 days: 6.98 days
Mean of tickets that took <30 days: 9.43 days

Actions

Copy link

#15

Updated by okurz about 4 years ago

File metabase_openqa_suse_de_job_modules_overview_2020-11-19.png metabase_openqa_suse_de_job_modules_overview_2020-11-19.png added

Was tricky to get escaping right to modify postgresql.conf but tested successfully on lord.arch, my test instance, with:

salt-call --local file.replace /var/lib/pgsql/data/postgresql.conf pattern="(listen_addresses = ')[^']*('.*$)" repl='\1*\2'

did not find a parameter test=True for salt-call so I also did a bit of salt -C 'G@roles:webui' -l error --state-output=changes state.apply openqa.server test=True on osd.

Updated https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/395 to also change the postgres server config to allow access outside localhost. This is already applied to osd.

Asked @gboiko in RC room #testing to create the database connection or give me access so that I can do that. But as there was no immediate answer I am experimenting with my own metabase instance, on c559.arch.suse.de (VM on lord.arch, Tumbleweed):

firewall-cmd --add-port 3000/tcp
zypper -n in podman
podman run -d -p 3000:3000 --name metabase metabase/metabase

then in browser one can connect to http://10.161.61.47:3000/ (DNS does not work, likely because IPv6 on lord.arch is incorrectly configured by me).

I configured postgres database access:

host: openqa.suse.de
DB name: openqa
user: openqa
password: openqa

I see that it would have actually been possible to use an SSH connection (or SSH tunnel even) but we do not need that anymore. I selected some conservative settings, e.g. no automatic syncs, schema sync only on demand, etc.

This is fun. After configuring the database metabase already shows some fancy stuff from the beginning, see attachment.

EDIT: 2020-11-19: Tried to enable LDAP authentication as well following https://confluence.suse.com/display/enginfra/pan.suse.de

EDIT: 2020-11-19: I wonder what other tables we can provide access to without making access too easy for data that one could misuse to derive personal performance information which we must avoid to not have problems with our workers council :)

Actions

Copy link

#16

Updated by okurz about 4 years ago

Due date set to 2020-12-03
Status changed from In Progress to Feedback

DB access is possible and I setup a proof-of-concept metabase instance. I provided credentials to @gboiko and he will inform me/us if the server he maintains is available again and ready for use for the mentioned use case.

Actions

Copy link

#17

Updated by okurz about 4 years ago

The data is available on
https://maintstats.dyn.cloud.suse.de/browse/5

please be aware that only "jobs" and "job_modules" are currently configured to be readable. Also be aware that with our current storage available for openqa.suse.de we are quite limited how long we can store results. Some "important" jobs are kept around for very long, e.g. milestone GA builds as well as jobs linked to bug reports if marked in openQA as such but general test results are reliable for a period of roughly 10 days only.

I wonder why it shows all tables as such "jobs" and "job_modules" are readable. Maybe we can hide some tables in there.
One difference is that I was using v0.37.2 of metabase, maintstats.dyn.cloud.suse.de is 0.36

So with this I see it as hopefully helpful what we could accomplish to answer the questions of "How often each test modules fails, which fail the most often" already in the beginning with giving everyone the possibility to use "metabase" in a convenient, pretty interface for further analysis. The question for "what are the trends of failure rate...." is likely better answered by what @VANASTASIADIS provided in #76903#note-14 or maybe by looking at the github repository of os-autoinst-distri-opensuse itself as we can not answer that directly easily from openQA test results from the live, production database.

@vpelcak @bzoltan1 I am happy to receive further feedback from you, otherwise I would regard this ticket as "Resolved".

Actions

Copy link

#18

Updated by okurz almost 4 years ago

Status changed from Feedback to Resolved

I take "no response" as "ok with what we have achieved" :)

Actions

Copy link

#19

Updated by okurz over 3 years ago

File which_openQA_testsuites_and_test_modules_fail_the_most_on_OSD_metabase.png which_openQA_testsuites_and_test_modules_fail_the_most_on_OSD_metabase.png added

A more recent example which_openQA_testsuites_and_test_modules_fail_the_most_on_OSD_metabase

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA » openQA Project » openQA Infrastructure

Tags

Custom queries

action #76903

openQA test modules failure statistics collection

User Story¶

Acceptance criteria¶

Out of scope¶

Suggestions¶

Updated by livdywan about 4 years ago

Updated by okurz about 4 years ago

Updated by VANASTASIADIS about 4 years ago

Updated by okurz about 4 years ago

Updated by okurz about 4 years ago

Updated by okurz about 4 years ago

Updated by okurz about 4 years ago

Updated by okurz about 4 years ago

Updated by okurz about 4 years ago

Updated by okurz about 4 years ago

Updated by okurz about 4 years ago

Updated by okurz about 4 years ago

Updated by VANASTASIADIS about 4 years ago

Updated by VANASTASIADIS about 4 years ago

Updated by okurz about 4 years ago

Updated by okurz about 4 years ago

Updated by okurz about 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz over 3 years ago