action #109310
closedQA - coordination #91646: [saga][epic] SUSE Maintenance QA workflows with fully automated testing, approval and release
QA - coordination #109641: [epic] qem-bot improvements
qem-bot/dashboard - mixed old and new incidents size:M
Description
Observation¶
Maintenance sometimes re-uses old incidents instead of creating new ones for package which leads to mixed results in dashboard :(
see: https://suse.slack.com/archives/C02D16TCP99/p1648721562205869
So we need workaround/solution for this corner case
See also https://github.com/openSUSE/qem-dashboard/issues/61
Originally brought up by coolo in
https://suse.slack.com/archives/C02D16TCP99/p1638283633141300
I just noticed a rather alarming issue: http://dashboard.qam.suse.de/incident/20989 talks about 43 passed, 1 failed jobs for the incident
Problems¶
- http://dashboard.qam.suse.de/incident/20639 references "208 passed, 4 failed, 12 stopped" and a link to openQA results https://openqa.suse.de/tests/overview?build=%3A20639%3Aopensc but the openQA test results only show 183 passed and 18 soft-failed
- -> dashboard should not say "passed" when it means "passed+softfailed" but "ok", see https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Jobs/Constants.pm#L76=
- -> Consider using time-fixed links, e.g. https://openqa.suse.de/tests/overview?build=%3A20639%3Aopensc&t=2022-04-01+08%3A53%3A19+%2B0000
- -> Ensure that the results are current and correspond to what openQA sees itself (numbers should match)
- -> Exclude any results that are outside a "reasonable time range", e.g. http://dashboard.qam.suse.de/blocked for 20639 shows incident results from some months ago, build 2021…
Acceptance criteria¶
- AC1: It is possible to reuse incidents and qem-bot can still approve releated release requests
Suggestions¶
- Read the qem-dashboard schema to understand where important settings are stored in https://github.com/openSUSE/qem-dashboard/ , in particular https://github.com/openSUSE/qem-dashboard/blob/main/migrations/dashboard.sql
- Read the proper manual process as "Workaround" and for us to understand (further down)
- Just delete all aggregate openQA data in qem-dashboard older than configurable, but default 90 days
Workarounds¶
- Ask maintenance to create a new, fresh incident, e.g. by a comment in IBS
- Detect invalid requests e.g. with outdates results and reject them
- Manually delete
Something along the lines of
ssh root@qam2.suse.de
machinectl shell postgresql
sudo -u postgres psql dashboard_db
(wreak havok in here)
SELECT update_settings FROM openqa_jobs WHERE update_settings is not NULL AND updated < NOW() - INTERVAL X
(store update_settings)
DELETE FROM openqa_jobs WHERE update_settings is not NULL AND updated < NOW() - INTERVAL X
DELETE FROM update_openqa_settings WHERE id in `stored update_settings`
Updated by okurz over 2 years ago
- Category set to Regressions/Crashes
- Target version set to Ready
Updated by osukup over 2 years ago
reported also as https://github.com/openSUSE/qem-dashboard/issues/61
Updated by osukup over 2 years ago
Problem here is cased by using INCIDENT as common identificator. Unfortunately this is sometimes reused.
We use INCIDENT because we schedule jobs in Testing queue with RRiD ( SUSE:Maintenace:INCIDENT:ReviewRequest ) and also Staging queue which haven't ReviewRequest ( SUSE:Maintenance:INCIDENT ) and we want tests in staging to be valid also after RR is created and incident is moved to testing queue.
- most proper solution will be create new identifier
- simple workaround - automatic data deletion based on age of data ( +- results older than month in aggregates are useless and can cause this problem)
Updated by osukup over 2 years ago
we have timestamp in results ... so we can pretty simply create cron job which cleanup results :D
Updated by livdywan over 2 years ago
- Subject changed from qem-bot/dashboard - mixed old and new incidents to qem-bot/dashboard - mixed old and new incidents size:M
- Description updated (diff)
- Category changed from Regressions/Crashes to Support
- Status changed from New to Workable
Updated by okurz over 2 years ago
- Status changed from Workable to In Progress
- Assignee set to okurz
Updated by okurz over 2 years ago
- Due date set to 2022-04-22
- Status changed from In Progress to Feedback
I asked in https://suse.slack.com/archives/C02CCRM8946/p1649327149714419
Hi, I have a question regarding submissions for incidents which kinda "reuse" old incident numbers, see https://progress.opensuse.org/issues/109310 for details about the problem that this brings. Does anyone have objections (and suggestions) if we (automatically) reject submissions for old incidents that existed in before? The effect would be that anyone creating new submissions would need to start new incidents (I hope I got the terms right here)
If there is no objection until 2022-04-22 then I suggest we ask openQA test reviewers to reject according submissions and look into automatic rejection.
Updated by okurz over 2 years ago
- Subject changed from qem-bot/dashboard - mixed old and new incidents size:M to qem-bot/dashboard - mixed old and new incidents
- Description updated (diff)
- Due date deleted (
2022-04-22) - Category changed from Support to Feature requests
- Status changed from Feedback to New
- Assignee deleted (
okurz) - Priority changed from Normal to Low
I asked maintenance experts in https://suse.slack.com/archives/C02CCRM8946/p1649327149714419 if they are ok if we reject such submissions and ask for resubmission as new incidents with a unique number. And with me stating the "harshest" option first I could spawn some quite helpful answers :)
Hi, I have a question regarding submissions for incidents which kinda "reuse" old incident numbers, see https://progress.opensuse.org/issues/109310 for details about the problem that this brings. Does anyone have objections (and suggestions) if we (automatically) reject submissions for old incidents that existed in before? The effect would be that anyone creating new submissions would need to start new incidents (I hope I got the terms right here)
Marina Latini and Simon Lees explaining when an incident is re-used:
Marina Latini: we are "reusing" old incidents only if really needed. we don't really use old parked incidents randomly. what you call reuse can be an incident with several resubmissions and where we had an initial declined/revoked RR for example.
we have also the case of re-releases of already released incidents and for those it's really wrong to create a new incident.
Simon Lees: the main place i've used them is if we release say SLE-15-SP3 with a regression but havent released for older codestreams then generally we will fix the regression in the older codestreams in the original incident rather then creating a new one (obviously for streams that are released we create a new one)
Oliver Kurz and Stephan Kulow: explaining that the impact of the issue is low so far:
Oliver Kurz: ok, I will see what we can do. @Stephan Kulow can you say why are we running into that problem reported now? I don't think that this is a change in maintenance processes. So it's either that qem-bot introduced a regression vs. older tooling or people have ignored the missing support for months/years
Stephan Kulow: As Marina mentioned, it's not done often - and we had this case in the past. But @Jozef Pupava just ignored old results (or they passed and there was nothing to ignore)
My suggestion and from Stephan Kulow:
Oliver Kurz: ok, so both you would say that incident_id+release_request_id should be enough to make it unique?
Stephan Kulow: The bot very well knows that the incident is bound to which RR - and if the RR changes, it needs to delete/invalidate the old data
So my open question: How easy would it be to implement that suggestion and where to start?
Updated by coolo over 2 years ago
the dashboard can trigger that cleanup when it gets new swamp data and notices an update of the RR.
Updated by okurz over 2 years ago
- Related to action #103701: Resubmited incident (ID) with new release request (RR) inherits incident test results from previous RR added
Updated by okurz over 2 years ago
I see that there was already an attempt which looks like it intended to address the same issue: https://gitlab.suse.de/qa-maintenance/bot-ng/-/merge_requests/46/diffs#14610756e1f5900260e8e8ecf7249d18a0fc7a5c_74_76
coolo wrote:
the dashboard can trigger that cleanup when it gets new swamp data and notices an update of the RR.
ok, sounds good.
Updated by okurz over 2 years ago
- Copied to action #109974: qem-bot/dashboard - mixed old and new incidents - potential future ideas added
Updated by okurz over 2 years ago
- Subject changed from qem-bot/dashboard - mixed old and new incidents to qem-bot/dashboard - mixed old and new incidents size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by osukup over 2 years ago
it can be shortened to 2x DELETE queries:
DELETE FROM update_openqa_settings WHERE id IN (SELECT update_settings FROM openqa_jobs WHERE update_settings is not NULL AND updated < NOW() - INTERVAL '90 days');
DELETE FROM openqa_jobs WHERE update_settings is not NULL AND updated < NOW() - INTERVAL '90 days')
Updated by osukup over 2 years ago
- Status changed from Workable to In Progress
- Assignee set to osukup
Updated by kraih over 2 years ago
Opened a PR: https://github.com/openSUSE/qem-dashboard/pull/63
Updated by kraih over 2 years ago
- Assignee changed from osukup to kraih
Stealing this ticket from Ondrej to keep an eye on it, since the proposed PR is now being deployed.
Updated by osukup over 2 years ago
kraih wrote:
Stealing this ticket from Ondrej to keep an eye on it, since the proposed PR is now being deployed.
one minute before mine action to forward to you and setting to feedback. Your solution is beautiful
Updated by kraih over 2 years ago
- Related to action #110409: qem-dashboard - remove old openQA jobs when rr_number changes size:M added
Updated by kraih over 2 years ago
- Status changed from Feedback to Resolved
I think this ticket is resolved, but there is more work to be done, so i've made a followup ticket with more cleanup requirements that have come up in the meantime. #110409
Updated by jbaier_cz over 2 years ago
- Related to action #114694: Incident seems to have missing aggregate test results in qem-dashboard but openQA jobs exists size:M added
Updated by jbaier_cz 10 months ago
- Related to action #155206: [qem-bot] re-release update can miss repo and thus not schedule updates added