action #109310
closedQA - coordination #91646: [saga][epic] SUSE Maintenance QA workflows with fully automated testing, approval and release
QA - coordination #109641: [epic] qem-bot improvements
qem-bot/dashboard - mixed old and new incidents size:M
Description
Observation¶
Maintenance sometimes re-uses old incidents instead of creating new ones for package which leads to mixed results in dashboard :(
see: https://suse.slack.com/archives/C02D16TCP99/p1648721562205869
So we need workaround/solution for this corner case
See also https://github.com/openSUSE/qem-dashboard/issues/61
Originally brought up by coolo in
https://suse.slack.com/archives/C02D16TCP99/p1638283633141300
I just noticed a rather alarming issue: http://dashboard.qam.suse.de/incident/20989 talks about 43 passed, 1 failed jobs for the incident
Problems¶
- http://dashboard.qam.suse.de/incident/20639 references "208 passed, 4 failed, 12 stopped" and a link to openQA results https://openqa.suse.de/tests/overview?build=%3A20639%3Aopensc but the openQA test results only show 183 passed and 18 soft-failed
- -> dashboard should not say "passed" when it means "passed+softfailed" but "ok", see https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Jobs/Constants.pm#L76=
- -> Consider using time-fixed links, e.g. https://openqa.suse.de/tests/overview?build=%3A20639%3Aopensc&t=2022-04-01+08%3A53%3A19+%2B0000
- -> Ensure that the results are current and correspond to what openQA sees itself (numbers should match)
- -> Exclude any results that are outside a "reasonable time range", e.g. http://dashboard.qam.suse.de/blocked for 20639 shows incident results from some months ago, build 2021…
Acceptance criteria¶
- AC1: It is possible to reuse incidents and qem-bot can still approve releated release requests
Suggestions¶
- Read the qem-dashboard schema to understand where important settings are stored in https://github.com/openSUSE/qem-dashboard/ , in particular https://github.com/openSUSE/qem-dashboard/blob/main/migrations/dashboard.sql
- Read the proper manual process as "Workaround" and for us to understand (further down)
- Just delete all aggregate openQA data in qem-dashboard older than configurable, but default 90 days
Workarounds¶
- Ask maintenance to create a new, fresh incident, e.g. by a comment in IBS
- Detect invalid requests e.g. with outdates results and reject them
- Manually delete
Something along the lines of
ssh root@qam2.suse.de
machinectl shell postgresql
sudo -u postgres psql dashboard_db
(wreak havok in here)
SELECT update_settings FROM openqa_jobs WHERE update_settings is not NULL AND updated < NOW() - INTERVAL X
(store update_settings)
DELETE FROM openqa_jobs WHERE update_settings is not NULL AND updated < NOW() - INTERVAL X
DELETE FROM update_openqa_settings WHERE id in `stored update_settings`