Project

General

Profile

action #76903

openQA test modules failure statistics collection

Added by vpelcak 3 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
2020-11-03
Due date:
2020-12-03
% Done:

0%

Estimated time:

Description

User Story

I would like us to be able to have some statistics about openQA test modules failures.
How often each test modules fails, which fail the most often, what are the trends of failure rate....

Acceptance criteria

  • AC1: Current job module details from OSD are available on a SUSE instance of Metabase

Out of scope

Persistent database setup and configuration on osd with salt -> #78222

Suggestions


Related issues

Copied to openQA Infrastructure - action #78222: Persistent postgres setup and configuration in saltWorkable

History

#1 Updated by cdywan 3 months ago

  • Description updated (diff)

#2 Updated by okurz 3 months ago

  • Subject changed from openQA testcase failure statistics collection to openQA test modules failure statistics collection
  • Description updated (diff)
  • Due date set to 2020-12-01
  • Status changed from New to Feedback
  • Assignee set to okurz
  • Target version set to Ready

Nice to have this ticket. I already thought about generic "read-only access to the openQA database" in general.

As soon as we have capacity in the team's backlog we can pick it up, hence setting to "Future" for now.

I clarified with vpelcak that the initial request was about "test modules", adapted issue description accordingly.

To answer once right now the question which test modules fail how often:

$ for i in o3 osd; do ssh $i "echo \"select script,count(id) from job_modules where result='failed' group by script order by count DESC limit 10;\" | sudo -u geekotest psql openqa"; done
                script                 | count 
---------------------------------------+-------
 tests/kernel/run_ltp.pm               | 22592
 tests/installation/start_install.pm   |  1818
 tests/shutdown/shutdown.pm            |   994
 tests/x11/firefox.pm                  |   767
 tests/x11/vlc.pm                      |   739
 tests/installation/zdup.pm            |   725
 tests/x11/chromium.pm                 |   724
 tests/console/consoletest_finish.pm   |   600
 tests/installation/bootloader_uefi.pm |   550
 tests/console/rails.pm                |   468
(10 rows)

                  script                  | count 
------------------------------------------+-------
 tests/kernel/run_ltp.pm                  | 14767
 tests/update/patch_sle.pm                |  3244
 tests/boot/boot_to_desktop.pm            |  2382
 tests/installation/addon_products_sle.pm |  1991
 tests/installation/upgrade_select.pm     |  1989
 tests/xfstests/generate_report.pm        |  1964
 tests/installation/welcome.pm            |  1936
 tests/installation/scc_registration.pm   |  1581
 tests/kernel/boot_ltp.pm                 |  1549
 tests/installation/first_boot.pm         |  1485
(10 rows)

to understand why "run_ltp" is so high up in the statistic: There can by "dynamic test modules" where one "perl script module" can be parameterized to run actual generated test modules. So it looks different if we look "by category and name":

$ for i in o3 osd; do ssh $i "echo \"select category,name,count(id) from job_modules where result='failed' group by category,name order by count DESC limit 10;\" | sudo -u geekotest psql openqa"; done
   category   |        name        | count 
--------------+--------------------+-------
 installation | start_install      |  1818
 shutdown     | shutdown           |   994
 x11          | firefox            |   767
 x11          | vlc                |   739
 installation | zdup               |   725
 x11          | chromium           |   724
 console      | consoletest_finish |   600
 installation | bootloader_uefi    |   519
 console      | rails              |   468
 console      | mysql_odbc         |   445
(10 rows)

   category   |        name        | count 
--------------+--------------------+-------
 update       | patch_sle          |  3244
 boot         | boot_to_desktop    |  2307
 installation | addon_products_sle |  1991
 installation | upgrade_select     |  1989
 installation | welcome            |  1936
 installation | scc_registration   |  1581
 kernel       | boot_ltp           |  1540
 installation | first_boot         |  1485
 updinstall   | update_install     |  1424
 console      | zypper_lifecycle   |  1327
(10 rows)

Now regarding metabase I think it's a nice idea especially considering that a metabase instance already exists (managed to login myself with username "okurz"). I think we need to follow
https://www.metabase.com/docs/v0.35.4/setting-up-metabase.html#gathering-your-database-info
and for that in my understanding we need an "Admin" on the instance.

vpelcak can you either make me/us admin or find an existing admin to add the corresponding openQA database access?

#3 Updated by VANASTASIADIS 3 months ago

okurz, Gustavo is an admin. I talked to him and he can add the db and any security restrictions as well, but he needs the credentials to access the db.
He suggested, and I agree, to create a read-only user on the db for metabase, if there isn't one already.

#4 Updated by okurz 3 months ago

ok, thanks for your help and great to see your comment.

I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/395 which should ensure there is a postgres user "openqa" but I will need to check first if this works with salt and how to enable access from the outside

#5 Updated by okurz 3 months ago

  • Description updated (diff)
  • Due date deleted (2020-12-01)
  • Status changed from Feedback to Workable
  • Assignee deleted (okurz)
  • Priority changed from Normal to Low

Updated with suggestions/tasks

#6 Updated by okurz 3 months ago

  • Target version changed from Ready to future

#7 Updated by okurz 2 months ago

Would you like to pick up this ticket? I can provide help and support but just not following on right now myself.

#8 Updated by okurz 2 months ago

  • Target version changed from future to Ready

#9 Updated by okurz 2 months ago

discussed with bzoltan1 and @VANASTASIADIS . Idea: Change query to limit to "last 365 days" and "maintenance job groups".

hypothesis: are more test modules failing in the last half a year than in before? I.e. ratio "failing / passed"

We need a rough answer for some data quick, i.e. "days". @VANASTASIADIS should hack up some python script to get median/mean/std for "qam tickets" in past year or so. And then in scope of "weeks" we work on connecting openQA db to metabase for future cool stuff :)

#10 Updated by okurz 2 months ago

  • Copied to action #78222: Persistent postgres setup and configuration in salt added

#11 Updated by okurz 2 months ago

  • Description updated (diff)
  • Status changed from Workable to In Progress
  • Assignee set to okurz

I have moved out the salt specific parts into a separate ticket as we do not have the postgres database setup covered with salt in general so far except for the "telegraf" roles we add, which is a good start though.

I will look into my MR again.

#12 Updated by okurz 2 months ago

  • Priority changed from Low to High

the ticket is actually more important to QAM so bumping to High.

#13 Updated by VANASTASIADIS 2 months ago

okurz bzoltan1
First results from the redmine api scripts:

For all tickets with [qam] or [qem] in the subject:
All time MEDIAN time open->close: 29.14 DAYS
Last year MEDIAN time open->close: 47.95 DAYS

All time MEAN time open->close: 86.88 DAYS
Last year MEAN time open->close: 124.67 DAYS

Done by calculating closing - creation date of a ticket.
I use the closing date of a ticket as the ticket's overall date, to decide whether it's in the last year or not.

I removed tickets with 0 time elapsed between opening and closing of a ticket, but I did not remove anything else. If you have any ideas on how to filter out more tickets, I'll implement that too.

#14 Updated by VANASTASIADIS 2 months ago

Some more detailed results:

All time:
Tickets that took 30 days were 591
Tickets that take >30 days are 49.25% of total functional closed tickets
Median of tickets that took <30 days: 5.78 days
Mean of tickets that took <30 days: 8.43 days

Last year:
Tickets that took 30 days were 464
Tickets that take >30 days are 55.17% of total functional closed tickets
Median of tickets that took <30 days: 6.98 days
Mean of tickets that took <30 days: 9.43 days

#15 Updated by okurz 2 months ago

10816

Was tricky to get escaping right to modify postgresql.conf but tested successfully on lord.arch, my test instance, with:

salt-call --local file.replace /var/lib/pgsql/data/postgresql.conf pattern="(listen_addresses = ')[^']*('.*$)" repl='\1*\2'

did not find a parameter test=True for salt-call so I also did a bit of salt -C 'G@roles:webui' -l error --state-output=changes state.apply openqa.server test=True on osd.

Updated https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/395 to also change the postgres server config to allow access outside localhost. This is already applied to osd.

Asked gboiko in RC room #testing to create the database connection or give me access so that I can do that. But as there was no immediate answer I am experimenting with my own metabase instance, on c559.arch.suse.de (VM on lord.arch, Tumbleweed):

firewall-cmd --add-port 3000/tcp
zypper -n in podman
podman run -d -p 3000:3000 --name metabase metabase/metabase

then in browser one can connect to http://10.161.61.47:3000/ (DNS does not work, likely because IPv6 on lord.arch is incorrectly configured by me).

I configured postgres database access:

  • host: openqa.suse.de
  • DB name: openqa
  • user: openqa
  • password: openqa

I see that it would have actually been possible to use an SSH connection (or SSH tunnel even) but we do not need that anymore. I selected some conservative settings, e.g. no automatic syncs, schema sync only on demand, etc.

This is fun. After configuring the database metabase already shows some fancy stuff from the beginning, see attachment.

EDIT: 2020-11-19: Tried to enable LDAP authentication as well following https://confluence.suse.com/display/enginfra/pan.suse.de

EDIT: 2020-11-19: I wonder what other tables we can provide access to without making access too easy for data that one could misuse to derive personal performance information which we must avoid to not have problems with our workers council :)

#16 Updated by okurz 2 months ago

  • Due date set to 2020-12-03
  • Status changed from In Progress to Feedback

DB access is possible and I setup a proof-of-concept metabase instance. I provided credentials to gboiko and he will inform me/us if the server he maintains is available again and ready for use for the mentioned use case.

#17 Updated by okurz 2 months ago

The data is available on
https://maintstats.dyn.cloud.suse.de/browse/5

please be aware that only "jobs" and "job_modules" are currently configured to be readable. Also be aware that with our current storage available for openqa.suse.de we are quite limited how long we can store results. Some "important" jobs are kept around for very long, e.g. milestone GA builds as well as jobs linked to bug reports if marked in openQA as such but general test results are reliable for a period of roughly 10 days only.

I wonder why it shows all tables as such "jobs" and "job_modules" are readable. Maybe we can hide some tables in there.
One difference is that I was using v0.37.2 of metabase, maintstats.dyn.cloud.suse.de is 0.36

So with this I see it as hopefully helpful what we could accomplish to answer the questions of "How often each test modules fails, which fail the most often" already in the beginning with giving everyone the possibility to use "metabase" in a convenient, pretty interface for further analysis. The question for "what are the trends of failure rate...." is likely better answered by what @VANASTASIADIS provided in #76903#note-14 or maybe by looking at the github repository of os-autoinst-distri-opensuse itself as we can not answer that directly easily from openQA test results from the live, production database.

vpelcak bzoltan1 I am happy to receive further feedback from you, otherwise I would regard this ticket as "Resolved".

#18 Updated by okurz about 2 months ago

  • Status changed from Feedback to Resolved

I take "no response" as "ok with what we have achieved" :)

Also available in: Atom PDF