openQA test modules failure statistics collection
I would like us to be able to have some statistics about openQA test modules failures.
How often each test modules fails, which fail the most often, what are the trends of failure rate....
- AC1: Current job module details from OSD are available on a SUSE instance of Metabase
Out of scope¶
Persistent database setup and configuration on osd with salt -> #78222
- Use the internal Metabase instance (login by OKTA credentials, but as login put only part of e-mail address before the @)
- Look into exposing read-only database access to make queries available, e.g. similar to what we already do for "telegraf" in https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/server.sls#L163 but including the "job_modules" table
- take https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/395 as inspiration
- Allow access to postgres database from outside "localhost":
- Research security implications and "best practices", e.g. follow https://blog.bigbinary.com/2016/01/23/configure-postgresql-to-allow-remote-connection.html
- Change postgres "listen" address in postgresql.conf to listen to the ethernet device and not only localhost, e.g. postgr
- Change postgres config in pg_hba.conf to allow access for remote connections
- Subject changed from openQA testcase failure statistics collection to openQA test modules failure statistics collection
- Description updated (diff)
- Due date set to 2020-12-01
- Status changed from New to Feedback
- Assignee set to okurz
- Target version set to Ready
Nice to have this ticket. I already thought about generic "read-only access to the openQA database" in general.
As soon as we have capacity in the team's backlog we can pick it up, hence setting to "Future" for now.
I clarified with vpelcak that the initial request was about "test modules", adapted issue description accordingly.
To answer once right now the question which test modules fail how often:
$ for i in o3 osd; do ssh $i "echo \"select script,count(id) from job_modules where result='failed' group by script order by count DESC limit 10;\" | sudo -u geekotest psql openqa"; done script | count ---------------------------------------+------- tests/kernel/run_ltp.pm | 22592 tests/installation/start_install.pm | 1818 tests/shutdown/shutdown.pm | 994 tests/x11/firefox.pm | 767 tests/x11/vlc.pm | 739 tests/installation/zdup.pm | 725 tests/x11/chromium.pm | 724 tests/console/consoletest_finish.pm | 600 tests/installation/bootloader_uefi.pm | 550 tests/console/rails.pm | 468 (10 rows) script | count ------------------------------------------+------- tests/kernel/run_ltp.pm | 14767 tests/update/patch_sle.pm | 3244 tests/boot/boot_to_desktop.pm | 2382 tests/installation/addon_products_sle.pm | 1991 tests/installation/upgrade_select.pm | 1989 tests/xfstests/generate_report.pm | 1964 tests/installation/welcome.pm | 1936 tests/installation/scc_registration.pm | 1581 tests/kernel/boot_ltp.pm | 1549 tests/installation/first_boot.pm | 1485 (10 rows)
to understand why "run_ltp" is so high up in the statistic: There can by "dynamic test modules" where one "perl script module" can be parameterized to run actual generated test modules. So it looks different if we look "by category and name":
$ for i in o3 osd; do ssh $i "echo \"select category,name,count(id) from job_modules where result='failed' group by category,name order by count DESC limit 10;\" | sudo -u geekotest psql openqa"; done category | name | count --------------+--------------------+------- installation | start_install | 1818 shutdown | shutdown | 994 x11 | firefox | 767 x11 | vlc | 739 installation | zdup | 725 x11 | chromium | 724 console | consoletest_finish | 600 installation | bootloader_uefi | 519 console | rails | 468 console | mysql_odbc | 445 (10 rows) category | name | count --------------+--------------------+------- update | patch_sle | 3244 boot | boot_to_desktop | 2307 installation | addon_products_sle | 1991 installation | upgrade_select | 1989 installation | welcome | 1936 installation | scc_registration | 1581 kernel | boot_ltp | 1540 installation | first_boot | 1485 updinstall | update_install | 1424 console | zypper_lifecycle | 1327 (10 rows)
Now regarding metabase I think it's a nice idea especially considering that a metabase instance already exists (managed to login myself with username "okurz"). I think we need to follow
and for that in my understanding we need an "Admin" on the instance.
vpelcak can you either make me/us admin or find an existing admin to add the corresponding openQA database access?
ok, thanks for your help and great to see your comment.
I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/395 which should ensure there is a postgres user "openqa" but I will need to check first if this works with salt and how to enable access from the outside
discussed with bzoltan1 and @VANASTASIADIS . Idea: Change query to limit to "last 365 days" and "maintenance job groups".
hypothesis: are more test modules failing in the last half a year than in before? I.e. ratio "failing / passed"
We need a rough answer for some data quick, i.e. "days". @VANASTASIADIS should hack up some python script to get median/mean/std for "qam tickets" in past year or so. And then in scope of "weeks" we work on connecting openQA db to metabase for future cool stuff :)
- Description updated (diff)
- Status changed from Workable to In Progress
- Assignee set to okurz
I have moved out the salt specific parts into a separate ticket as we do not have the postgres database setup covered with salt in general so far except for the "telegraf" roles we add, which is a good start though.
I will look into my MR again.
#13 Updated by VANASTASIADIS 2 months ago
For all tickets with [qam] or [qem] in the subject:
All time MEDIAN time open->close: 29.14 DAYS
Last year MEDIAN time open->close: 47.95 DAYS
All time MEAN time open->close: 86.88 DAYS
Last year MEAN time open->close: 124.67 DAYS
Done by calculating closing - creation date of a ticket.
I use the closing date of a ticket as the ticket's overall date, to decide whether it's in the last year or not.
I removed tickets with 0 time elapsed between opening and closing of a ticket, but I did not remove anything else. If you have any ideas on how to filter out more tickets, I'll implement that too.
#14 Updated by VANASTASIADIS 2 months ago
Some more detailed results:
Tickets that took 30 days were 591
Tickets that take >30 days are 49.25% of total functional closed tickets
Median of tickets that took <30 days: 5.78 days
Mean of tickets that took <30 days: 8.43 days
Tickets that took 30 days were 464
Tickets that take >30 days are 55.17% of total functional closed tickets
Median of tickets that took <30 days: 6.98 days
Mean of tickets that took <30 days: 9.43 days
- File metabase_openqa_suse_de_job_modules_overview_2020-11-19.png metabase_openqa_suse_de_job_modules_overview_2020-11-19.png added
Was tricky to get escaping right to modify postgresql.conf but tested successfully on lord.arch, my test instance, with:
salt-call --local file.replace /var/lib/pgsql/data/postgresql.conf pattern="(listen_addresses = ')[^']*('.*$)" repl='\1*\2'
did not find a parameter
test=True for salt-call so I also did a bit of
salt -C 'G@roles:webui' -l error --state-output=changes state.apply openqa.server test=True on osd.
Updated https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/395 to also change the postgres server config to allow access outside localhost. This is already applied to osd.
Asked gboiko in RC room #testing to create the database connection or give me access so that I can do that. But as there was no immediate answer I am experimenting with my own metabase instance, on c559.arch.suse.de (VM on lord.arch, Tumbleweed):
firewall-cmd --add-port 3000/tcp zypper -n in podman podman run -d -p 3000:3000 --name metabase metabase/metabase
then in browser one can connect to http://10.161.61.47:3000/ (DNS does not work, likely because IPv6 on lord.arch is incorrectly configured by me).
I configured postgres database access:
- host: openqa.suse.de
- DB name: openqa
- user: openqa
- password: openqa
I see that it would have actually been possible to use an SSH connection (or SSH tunnel even) but we do not need that anymore. I selected some conservative settings, e.g. no automatic syncs, schema sync only on demand, etc.
This is fun. After configuring the database metabase already shows some fancy stuff from the beginning, see attachment.
EDIT: 2020-11-19: Tried to enable LDAP authentication as well following https://confluence.suse.com/display/enginfra/pan.suse.de
EDIT: 2020-11-19: I wonder what other tables we can provide access to without making access too easy for data that one could misuse to derive personal performance information which we must avoid to not have problems with our workers council :)
- Due date set to 2020-12-03
- Status changed from In Progress to Feedback
DB access is possible and I setup a proof-of-concept metabase instance. I provided credentials to gboiko and he will inform me/us if the server he maintains is available again and ready for use for the mentioned use case.
The data is available on
please be aware that only "jobs" and "job_modules" are currently configured to be readable. Also be aware that with our current storage available for openqa.suse.de we are quite limited how long we can store results. Some "important" jobs are kept around for very long, e.g. milestone GA builds as well as jobs linked to bug reports if marked in openQA as such but general test results are reliable for a period of roughly 10 days only.
I wonder why it shows all tables as such "jobs" and "job_modules" are readable. Maybe we can hide some tables in there.
One difference is that I was using v0.37.2 of metabase, maintstats.dyn.cloud.suse.de is 0.36
So with this I see it as hopefully helpful what we could accomplish to answer the questions of "How often each test modules fails, which fail the most often" already in the beginning with giving everyone the possibility to use "metabase" in a convenient, pretty interface for further analysis. The question for "what are the trends of failure rate...." is likely better answered by what @VANASTASIADIS provided in #76903#note-14 or maybe by looking at the github repository of os-autoinst-distri-opensuse itself as we can not answer that directly easily from openQA test results from the live, production database.