action #76903
closedopenQA test modules failure statistics collection
0%
Description
User Story¶
I would like us to be able to have some statistics about openQA test modules failures.
How often each test modules fails, which fail the most often, what are the trends of failure rate....
Acceptance criteria¶
- AC1: Current job module details from OSD are available on a SUSE instance of Metabase
Out of scope¶
Persistent database setup and configuration on osd with salt -> #78222
Suggestions¶
- Use the internal Metabase instance (login by OKTA credentials, but as login put only part of e-mail address before the @)
- Look into exposing read-only database access to make queries available, e.g. similar to what we already do for "telegraf" in https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/server.sls#L163 but including the "job_modules" table
- take https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/395 as inspiration
- Allow access to postgres database from outside "localhost":
- Research security implications and "best practices", e.g. follow https://blog.bigbinary.com/2016/01/23/configure-postgresql-to-allow-remote-connection.html
- Change postgres "listen" address in postgresql.conf to listen to the ethernet device and not only localhost, e.g. postgr
- Change postgres config in pg_hba.conf to allow access for remote connections
Files
Updated by okurz about 4 years ago
- Subject changed from openQA testcase failure statistics collection to openQA test modules failure statistics collection
- Description updated (diff)
- Due date set to 2020-12-01
- Status changed from New to Feedback
- Assignee set to okurz
- Target version set to Ready
Nice to have this ticket. I already thought about generic "read-only access to the openQA database" in general.
As soon as we have capacity in the team's backlog we can pick it up, hence setting to "Future" for now.
I clarified with vpelcak that the initial request was about "test modules", adapted issue description accordingly.
To answer once right now the question which test modules fail how often:
$ for i in o3 osd; do ssh $i "echo \"select script,count(id) from job_modules where result='failed' group by script order by count DESC limit 10;\" | sudo -u geekotest psql openqa"; done
script | count
---------------------------------------+-------
tests/kernel/run_ltp.pm | 22592
tests/installation/start_install.pm | 1818
tests/shutdown/shutdown.pm | 994
tests/x11/firefox.pm | 767
tests/x11/vlc.pm | 739
tests/installation/zdup.pm | 725
tests/x11/chromium.pm | 724
tests/console/consoletest_finish.pm | 600
tests/installation/bootloader_uefi.pm | 550
tests/console/rails.pm | 468
(10 rows)
script | count
------------------------------------------+-------
tests/kernel/run_ltp.pm | 14767
tests/update/patch_sle.pm | 3244
tests/boot/boot_to_desktop.pm | 2382
tests/installation/addon_products_sle.pm | 1991
tests/installation/upgrade_select.pm | 1989
tests/xfstests/generate_report.pm | 1964
tests/installation/welcome.pm | 1936
tests/installation/scc_registration.pm | 1581
tests/kernel/boot_ltp.pm | 1549
tests/installation/first_boot.pm | 1485
(10 rows)
to understand why "run_ltp" is so high up in the statistic: There can by "dynamic test modules" where one "perl script module" can be parameterized to run actual generated test modules. So it looks different if we look "by category and name":
$ for i in o3 osd; do ssh $i "echo \"select category,name,count(id) from job_modules where result='failed' group by category,name order by count DESC limit 10;\" | sudo -u geekotest psql openqa"; done
category | name | count
--------------+--------------------+-------
installation | start_install | 1818
shutdown | shutdown | 994
x11 | firefox | 767
x11 | vlc | 739
installation | zdup | 725
x11 | chromium | 724
console | consoletest_finish | 600
installation | bootloader_uefi | 519
console | rails | 468
console | mysql_odbc | 445
(10 rows)
category | name | count
--------------+--------------------+-------
update | patch_sle | 3244
boot | boot_to_desktop | 2307
installation | addon_products_sle | 1991
installation | upgrade_select | 1989
installation | welcome | 1936
installation | scc_registration | 1581
kernel | boot_ltp | 1540
installation | first_boot | 1485
updinstall | update_install | 1424
console | zypper_lifecycle | 1327
(10 rows)
Now regarding metabase I think it's a nice idea especially considering that a metabase instance already exists (managed to login myself with username "okurz"). I think we need to follow
https://www.metabase.com/docs/v0.35.4/setting-up-metabase.html#gathering-your-database-info
and for that in my understanding we need an "Admin" on the instance.
@vpelcak can you either make me/us admin or find an existing admin to add the corresponding openQA database access?
Updated by VANASTASIADIS about 4 years ago
@okurz, Gustavo is an admin. I talked to him and he can add the db and any security restrictions as well, but he needs the credentials to access the db.
He suggested, and I agree, to create a read-only user on the db for metabase, if there isn't one already.
Updated by okurz about 4 years ago
ok, thanks for your help and great to see your comment.
I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/395 which should ensure there is a postgres user "openqa" but I will need to check first if this works with salt and how to enable access from the outside
Updated by okurz about 4 years ago
- Description updated (diff)
- Due date deleted (
2020-12-01) - Status changed from Feedback to Workable
- Assignee deleted (
okurz) - Priority changed from Normal to Low
Updated with suggestions/tasks
Updated by okurz about 4 years ago
Would you like to pick up this ticket? I can provide help and support but just not following on right now myself.
Updated by okurz about 4 years ago
discussed with @bzoltan1 and @VANASTASIADIS . Idea: Change query to limit to "last 365 days" and "maintenance job groups".
hypothesis: are more test modules failing in the last half a year than in before? I.e. ratio "failing / passed"
We need a rough answer for some data quick, i.e. "days". @VANASTASIADIS should hack up some python script to get median/mean/std for "qam tickets" in past year or so. And then in scope of "weeks" we work on connecting openQA db to metabase for future cool stuff :)
Updated by okurz about 4 years ago
- Copied to action #78222: Persistent postgres setup and configuration in salt added
Updated by okurz about 4 years ago
- Description updated (diff)
- Status changed from Workable to In Progress
- Assignee set to okurz
I have moved out the salt specific parts into a separate ticket as we do not have the postgres database setup covered with salt in general so far except for the "telegraf" roles we add, which is a good start though.
I will look into my MR again.
Updated by okurz about 4 years ago
- Priority changed from Low to High
the ticket is actually more important to QAM so bumping to High.
Updated by VANASTASIADIS about 4 years ago
@okurz @bzoltan1
First results from the redmine api scripts:
For all tickets with [qam] or [qem] in the subject:
All time MEDIAN time open->close: 29.14 DAYS
Last year MEDIAN time open->close: 47.95 DAYS
All time MEAN time open->close: 86.88 DAYS
Last year MEAN time open->close: 124.67 DAYS
Done by calculating closing - creation date of a ticket.
I use the closing date of a ticket as the ticket's overall date, to decide whether it's in the last year or not.
I removed tickets with 0 time elapsed between opening and closing of a ticket, but I did not remove anything else. If you have any ideas on how to filter out more tickets, I'll implement that too.
Updated by VANASTASIADIS about 4 years ago
Some more detailed results:
All time:
Tickets that took 30 days were 591
Tickets that take >30 days are 49.25% of total functional closed tickets
Median of tickets that took <30 days: 5.78 days
Mean of tickets that took <30 days: 8.43 days
Last year:
Tickets that took 30 days were 464
Tickets that take >30 days are 55.17% of total functional closed tickets
Median of tickets that took <30 days: 6.98 days
Mean of tickets that took <30 days: 9.43 days
Updated by okurz about 4 years ago
- File metabase_openqa_suse_de_job_modules_overview_2020-11-19.png metabase_openqa_suse_de_job_modules_overview_2020-11-19.png added
Was tricky to get escaping right to modify postgresql.conf but tested successfully on lord.arch, my test instance, with:
salt-call --local file.replace /var/lib/pgsql/data/postgresql.conf pattern="(listen_addresses = ')[^']*('.*$)" repl='\1*\2'
did not find a parameter test=True
for salt-call so I also did a bit of salt -C 'G@roles:webui' -l error --state-output=changes state.apply openqa.server test=True
on osd.
Updated https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/395 to also change the postgres server config to allow access outside localhost. This is already applied to osd.
Asked @gboiko in RC room #testing to create the database connection or give me access so that I can do that. But as there was no immediate answer I am experimenting with my own metabase instance, on c559.arch.suse.de (VM on lord.arch, Tumbleweed):
firewall-cmd --add-port 3000/tcp
zypper -n in podman
podman run -d -p 3000:3000 --name metabase metabase/metabase
then in browser one can connect to http://10.161.61.47:3000/ (DNS does not work, likely because IPv6 on lord.arch is incorrectly configured by me).
I configured postgres database access:
- host: openqa.suse.de
- DB name: openqa
- user: openqa
- password: openqa
I see that it would have actually been possible to use an SSH connection (or SSH tunnel even) but we do not need that anymore. I selected some conservative settings, e.g. no automatic syncs, schema sync only on demand, etc.
This is fun. After configuring the database metabase already shows some fancy stuff from the beginning, see attachment.
EDIT: 2020-11-19: Tried to enable LDAP authentication as well following https://confluence.suse.com/display/enginfra/pan.suse.de
EDIT: 2020-11-19: I wonder what other tables we can provide access to without making access too easy for data that one could misuse to derive personal performance information which we must avoid to not have problems with our workers council :)
Updated by okurz about 4 years ago
- Due date set to 2020-12-03
- Status changed from In Progress to Feedback
DB access is possible and I setup a proof-of-concept metabase instance. I provided credentials to @gboiko and he will inform me/us if the server he maintains is available again and ready for use for the mentioned use case.
Updated by okurz about 4 years ago
The data is available on
https://maintstats.dyn.cloud.suse.de/browse/5
please be aware that only "jobs" and "job_modules" are currently configured to be readable. Also be aware that with our current storage available for openqa.suse.de we are quite limited how long we can store results. Some "important" jobs are kept around for very long, e.g. milestone GA builds as well as jobs linked to bug reports if marked in openQA as such but general test results are reliable for a period of roughly 10 days only.
I wonder why it shows all tables as such "jobs" and "job_modules" are readable. Maybe we can hide some tables in there.
One difference is that I was using v0.37.2 of metabase, maintstats.dyn.cloud.suse.de is 0.36
So with this I see it as hopefully helpful what we could accomplish to answer the questions of "How often each test modules fails, which fail the most often" already in the beginning with giving everyone the possibility to use "metabase" in a convenient, pretty interface for further analysis. The question for "what are the trends of failure rate...." is likely better answered by what @VANASTASIADIS provided in #76903#note-14 or maybe by looking at the github repository of os-autoinst-distri-opensuse itself as we can not answer that directly easily from openQA test results from the live, production database.
@vpelcak @bzoltan1 I am happy to receive further feedback from you, otherwise I would regard this ticket as "Resolved".
Updated by okurz almost 4 years ago
- Status changed from Feedback to Resolved
I take "no response" as "ok with what we have achieved" :)
Updated by okurz over 3 years ago
- File which_openQA_testsuites_and_test_modules_fail_the_most_on_OSD_metabase.png which_openQA_testsuites_and_test_modules_fail_the_most_on_OSD_metabase.png added
A more recent example