Project

General

Profile

Actions

action #109310

closed

QA (public) - coordination #91646: [saga][epic] SUSE Maintenance QA workflows with fully automated testing, approval and release

QA (public) - coordination #109641: [epic] qem-bot improvements

qem-bot/dashboard - mixed old and new incidents size:M

Added by osukup over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
Feature requests
Target version:
Start date:
2022-03-31
Due date:
% Done:

0%

Estimated time:

Description

Observation

Maintenance sometimes re-uses old incidents instead of creating new ones for package which leads to mixed results in dashboard :(

see: https://suse.slack.com/archives/C02D16TCP99/p1648721562205869

So we need workaround/solution for this corner case

See also https://github.com/openSUSE/qem-dashboard/issues/61

Originally brought up by coolo in
https://suse.slack.com/archives/C02D16TCP99/p1638283633141300

I just noticed a rather alarming issue: http://dashboard.qam.suse.de/incident/20989 talks about 43 passed, 1 failed jobs for the incident

Problems

Acceptance criteria

  • AC1: It is possible to reuse incidents and qem-bot can still approve releated release requests

Suggestions

Workarounds

  • Ask maintenance to create a new, fresh incident, e.g. by a comment in IBS
  • Detect invalid requests e.g. with outdates results and reject them
  • Manually delete

Something along the lines of

ssh root@qam2.suse.de
machinectl shell postgresql
sudo -u postgres psql dashboard_db
(wreak havok in here)

SELECT update_settings FROM openqa_jobs WHERE update_settings is not NULL AND updated < NOW() - INTERVAL X
(store update_settings)

DELETE FROM openqa_jobs WHERE update_settings is not NULL AND updated < NOW() - INTERVAL X
DELETE FROM update_openqa_settings WHERE id in `stored update_settings`

Related issues 5 (2 open3 closed)

Related to QA (public) - action #103701: Resubmited incident (ID) with new release request (RR) inherits incident test results from previous RRResolvedosukup2021-12-08

Actions
Related to QA (public) - action #110409: qem-dashboard - remove old openQA jobs when rr_number changes size:MResolvedkraih2022-04-28

Actions
Related to QA (public) - action #114694: Incident seems to have missing aggregate test results in qem-dashboard but openQA jobs exists size:MResolvedkraih2022-07-26

Actions
Related to QA (public) - action #155206: [qem-bot] re-release update can miss repo and thus not schedule updatesNew2024-02-08

Actions
Copied to openQA Project (public) - action #109974: qem-bot/dashboard - mixed old and new incidents - potential future ideasNew

Actions
Actions #1

Updated by osukup over 2 years ago

  • Project changed from QA (public) to openQA Project (public)
Actions #2

Updated by okurz over 2 years ago

  • Category set to Regressions/Crashes
  • Target version set to Ready
Actions #3

Updated by okurz over 2 years ago

  • Description updated (diff)
Actions #5

Updated by osukup over 2 years ago

Problem here is cased by using INCIDENT as common identificator. Unfortunately this is sometimes reused.

We use INCIDENT because we schedule jobs in Testing queue with RRiD ( SUSE:Maintenace:INCIDENT:ReviewRequest ) and also Staging queue which haven't ReviewRequest ( SUSE:Maintenance:INCIDENT ) and we want tests in staging to be valid also after RR is created and incident is moved to testing queue.

  • most proper solution will be create new identifier
  • simple workaround - automatic data deletion based on age of data ( +- results older than month in aggregates are useless and can cause this problem)
Actions #6

Updated by osukup over 2 years ago

we have timestamp in results ... so we can pretty simply create cron job which cleanup results :D

Actions #7

Updated by livdywan over 2 years ago

  • Subject changed from qem-bot/dashboard - mixed old and new incidents to qem-bot/dashboard - mixed old and new incidents size:M
  • Description updated (diff)
  • Category changed from Regressions/Crashes to Support
  • Status changed from New to Workable
Actions #8

Updated by okurz over 2 years ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz
Actions #9

Updated by okurz over 2 years ago

  • Due date set to 2022-04-22
  • Status changed from In Progress to Feedback

I asked in https://suse.slack.com/archives/C02CCRM8946/p1649327149714419

Hi, I have a question regarding submissions for incidents which kinda "reuse" old incident numbers, see https://progress.opensuse.org/issues/109310 for details about the problem that this brings. Does anyone have objections (and suggestions) if we (automatically) reject submissions for old incidents that existed in before? The effect would be that anyone creating new submissions would need to start new incidents (I hope I got the terms right here)

If there is no objection until 2022-04-22 then I suggest we ask openQA test reviewers to reject according submissions and look into automatic rejection.

Actions #10

Updated by okurz over 2 years ago

  • Subject changed from qem-bot/dashboard - mixed old and new incidents size:M to qem-bot/dashboard - mixed old and new incidents
  • Description updated (diff)
  • Due date deleted (2022-04-22)
  • Category changed from Support to Feature requests
  • Status changed from Feedback to New
  • Assignee deleted (okurz)
  • Priority changed from Normal to Low

I asked maintenance experts in https://suse.slack.com/archives/C02CCRM8946/p1649327149714419 if they are ok if we reject such submissions and ask for resubmission as new incidents with a unique number. And with me stating the "harshest" option first I could spawn some quite helpful answers :)

Hi, I have a question regarding submissions for incidents which kinda "reuse" old incident numbers, see https://progress.opensuse.org/issues/109310 for details about the problem that this brings. Does anyone have objections (and suggestions) if we (automatically) reject submissions for old incidents that existed in before? The effect would be that anyone creating new submissions would need to start new incidents (I hope I got the terms right here)

Marina Latini and Simon Lees explaining when an incident is re-used:

Marina Latini: we are "reusing" old incidents only if really needed. we don't really use old parked incidents randomly. what you call reuse can be an incident with several resubmissions and where we had an initial declined/revoked RR for example.
we have also the case of re-releases of already released incidents and for those it's really wrong to create a new incident.
Simon Lees: the main place i've used them is if we release say SLE-15-SP3 with a regression but havent released for older codestreams then generally we will fix the regression in the older codestreams in the original incident rather then creating a new one (obviously for streams that are released we create a new one)

Oliver Kurz and Stephan Kulow: explaining that the impact of the issue is low so far:

Oliver Kurz: ok, I will see what we can do. @Stephan Kulow can you say why are we running into that problem reported now? I don't think that this is a change in maintenance processes. So it's either that qem-bot introduced a regression vs. older tooling or people have ignored the missing support for months/years
Stephan Kulow: As Marina mentioned, it's not done often - and we had this case in the past. But @Jozef Pupava just ignored old results (or they passed and there was nothing to ignore)

My suggestion and from Stephan Kulow:

Oliver Kurz: ok, so both you would say that incident_id+release_request_id should be enough to make it unique?
Stephan Kulow: The bot very well knows that the incident is bound to which RR - and if the RR changes, it needs to delete/invalidate the old data

So my open question: How easy would it be to implement that suggestion and where to start?

Actions #11

Updated by coolo over 2 years ago

the dashboard can trigger that cleanup when it gets new swamp data and notices an update of the RR.

Actions #12

Updated by okurz over 2 years ago

  • Related to action #103701: Resubmited incident (ID) with new release request (RR) inherits incident test results from previous RR added
Actions #13

Updated by okurz over 2 years ago

I see that there was already an attempt which looks like it intended to address the same issue: https://gitlab.suse.de/qa-maintenance/bot-ng/-/merge_requests/46/diffs#14610756e1f5900260e8e8ecf7249d18a0fc7a5c_74_76

coolo wrote:

the dashboard can trigger that cleanup when it gets new swamp data and notices an update of the RR.

ok, sounds good.

Actions #14

Updated by okurz over 2 years ago

  • Parent task set to #109641
Actions #15

Updated by okurz over 2 years ago

  • Copied to action #109974: qem-bot/dashboard - mixed old and new incidents - potential future ideas added
Actions #16

Updated by okurz over 2 years ago

  • Subject changed from qem-bot/dashboard - mixed old and new incidents to qem-bot/dashboard - mixed old and new incidents size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #17

Updated by okurz over 2 years ago

  • Description updated (diff)
Actions #18

Updated by okurz over 2 years ago

  • Description updated (diff)
Actions #19

Updated by osukup over 2 years ago

it can be shortened to 2x DELETE queries:

DELETE FROM update_openqa_settings WHERE id IN (SELECT update_settings FROM openqa_jobs WHERE update_settings is not NULL AND updated < NOW() - INTERVAL '90 days');
DELETE FROM openqa_jobs WHERE update_settings is not NULL AND updated < NOW() - INTERVAL '90 days')
Actions #20

Updated by osukup over 2 years ago

  • Status changed from Workable to In Progress
  • Assignee set to osukup
Actions #22

Updated by kraih over 2 years ago

  • Assignee changed from osukup to kraih

Stealing this ticket from Ondrej to keep an eye on it, since the proposed PR is now being deployed.

Actions #23

Updated by osukup over 2 years ago

kraih wrote:

Stealing this ticket from Ondrej to keep an eye on it, since the proposed PR is now being deployed.

one minute before mine action to forward to you and setting to feedback. Your solution is beautiful

Actions #24

Updated by kraih over 2 years ago

  • Status changed from In Progress to Feedback
Actions #25

Updated by kraih over 2 years ago

  • Related to action #110409: qem-dashboard - remove old openQA jobs when rr_number changes size:M added
Actions #26

Updated by kraih over 2 years ago

  • Status changed from Feedback to Resolved

I think this ticket is resolved, but there is more work to be done, so i've made a followup ticket with more cleanup requirements that have come up in the meantime. #110409

Actions #27

Updated by jbaier_cz over 2 years ago

  • Related to action #114694: Incident seems to have missing aggregate test results in qem-dashboard but openQA jobs exists size:M added
Actions #28

Updated by jbaier_cz 11 months ago

  • Related to action #155206: [qem-bot] re-release update can miss repo and thus not schedule updates added
Actions

Also available in: Atom PDF